High Performance Computing (HPC) is a response to the inadequacy of conventional computing for the solution of the largest and most complex problems. It’s been said for over three decades that the highest performance computer is one that’s only an order of magnitude slower than is needed. But why is this the case?
Conventional Computing Systems
Conventional computing involves having a single computer central processing unit (CPU) executing a single program or instruction stream, and reading and writing to a single hierarchy of storage units. The execution performance of the program is limited by the architecture of the computer and its components
The highest rate of operations in a CPU is that of its master clock oscillator; this is the rate of fundamental indivisible logical operations within the CPU. No single instruction can be executed faster than the time of a single clock pulse, and most take several “clocks” to complete. These values are results of the design of the CPU, and can vary even among a single family of processors.
Within any CPU, there is a hierarchy of circuits and operation speeds. Next to the clock rate, is usually the register-to-register transfer rate. The registers are small, fast storage devices that hold single data before and after they are operated on. For instance, two numbers to be added and their sum, may occupy three registers. Registers are explicitly addressed in program code. In some computers, a transfer from, say, the adder to the sum register may happen in a single clock; however, machines with larger and more complex register systems may require several clocks to complete such a move.
Next slower in CPU’s speed hierarchy is usually the memory subsystem. Some of this may be on the CPU chip and while other parts are off-chip. The on-chip portions are usually the “cache”, of which there may be several; cache is not explicitly used by the executing program, but is automatically managed by the CPU hardware in order to increase the average speed of memory operations. Cache accesses typically take from a few to ten or more clocks.
Regular memory, knows as Random Access Memory (RAM), is the semiconductor storage from which programs execute; accessing a word of RAM typically takes from tens to hundreds of clock pulses. The larger the memory, the longer the access time.
Thus, the speed of a conventional computer is limited by the rate that instructions and data can be moved between RAM and the CPU. Parts of the CPU-memory system can range in speeds over two to three orders of magnitude. A very fast modern processor with a clock frequency of 3 GHz has a pulse time of 0.333 nanoseconds. Execution events between it and its RAM may therefore take from 0.3 to 100 nanoseconds. The longer operations typically involve data movements off the chip, such as between CPU and RAM. Such movements are limited by the physics of the propagation of electrical signals.
Since as there is always more than a single program and data set to be executed, there must be other elements of a computer. The other data, applications, and operating system reside on much larger, slower, and cheaper storage devices that are both external to the CPU and “non-volatile”, meaning information stored on them is preserved when they are powered off. The most familiar such device is the hard disk drive, an assembly of magnetic platters, read-write heads, and control-interface circuitry. The cost per bit of a disk drive has sunk to under $1 per gigabyte, while that of RAM may be 100 times as much, and that of fast on-chip cache may be ten times as expensive, yet. This is why there is always much less cache than RAM and much less RAM than disk space.
An alternative to disk storage – Solid State Disk (SSD) – is becoming commercially available. It offers speed advantages, but at greater cost and lower bit density. For usage purposes, they can be considered functionally equivalent to. They can effectively used in contexts where lots of small I/O operations occur.
Fundamentally, the hierarchy of speeds and capacities must be balanced by system designers. The faster a component is, the more expensive it is, and therefore the lower its capacity will be. At the other end of the spectrum, the largest capacity devices, which can be made at least cost, will be the slowest ones.
Alternatives to Conventional Computing Systems
The path to higher performance computing is to use more CPUs to execute the program. Over the past three decades, this basic strategy has proved successful, in many incarnations. Today it is one level in a larger hierarchy of architectural techniques that increase performance. Before it began, replication of functional units within the CPU had already succeeded. Multiple adders, multipliers, and other logic was succeeded by vector-pipelined function units, which required changes in programming styles to exploit them.
The simplest multi-processor computing uses multiple CPUs, usually a small number, within a single computer. The CPUs, typically around ten, are on separate chips.; typically, all of them share access to a single RAM subsystem, though the inclusion of a cache hierarchy complicates the picture. Each CPU may have its own on-chip cache memory, and other levels may be shared among the CPUs. The competition to access shared RAM is mitigated by the individual cache units but complicated by the shared cache. This sharing is sometimes a greater disadvantage to programmers than the performance improvement it offers, due to changes in programming schemes needed to exploit it. Programmers must use “threads” of execution, which are short, sequences of instructions that can execute independently, and code segments in which conflicts can occur must be “serialized” for correctness. Language translators can usually generate threads from source programs, but they must be cautious to avoid pathologies in doing so.
A variation on shared memory systems is the multi-core system. It may contain one or more chips in which the “core” of the CPU is replicated on a single chip. This core will contain registers, function units, and possibly some cache. Multi-core systems are becoming so common that even laptops contain them, and application programmers now routinely program for multi-threaded execution. Given the prevalence of multiple cores in modern CPU chips, the term “socket” is often used to refer to a CPU chip or the space it occupies on a processor circuit board.
It is now common for desktop and server systems to contain multiple multi-core chips per circuit board, but practical limits keep them at small scale – a few cores per socket and a few sockets per board. The next step is typically referred to as “distributed memory”, since it addresses the limitations of effective memory sharing.
A distributed (once termed “multi-computer”) system contains numerous, essentially independent computers, often organized as “blades” (thin chassis) in a rack, each of which is a multi-socket system, perhaps accompanied by disks. In executing on a distributed memory system, the program is partitioned for execution on the individual computers. The partitions communicate to share data via messages sent over a special “interconnect” or network subsystem, which establishes a new level in the hierarchy of speeds and costs. They may now be so physically separated that signal propagation delays among them render tight synchronization impossible. Each system has its own master clock, and only software synchronization exists among them.
The term “node” is commonly used to refer to the unit of replication of a distributed memory system. The definitive characteristic of a node seems to be that it contains its own copy of the operating system.
Physical separation affects communication rates. The interconnection networks typically communicate at reasonable speeds, but the end-to-end transfer for the first bit of a message is relatively large – on the order of a microsecond, or 1000 nanoseconds. This requires careful thought in partitioning problems, and a new concern arises: granularity. This is the ratio of computation to communication rates, or the amount of computation that can occur while a message is transferred: it may be one to ten thousand instructions (or more). Overlapping and amortizing communication for good computational efficiency is nontrivial, and a major contributor to performance (or lack thereof).
Distributed memory systems scale much larger than shared memory systems. The latter tend to top out at a few tens of processors, which the former have already reached hundred of thousands of nodes. Such huge, expensive super-systems can occupy entire buildings and require more electrical power to keep them cool than to operate them. They can be difficult to manage, since not all programs or problems may require all available resources, and their cost mandates that no part of them be left idle for very long; they often must be partitioned for simultaneous execution of multiple program executions (“space sharing”). Additionally, the mean time between failures for individual components is such that, in such large collections, some component may be fail at any time, requiring attention to fault tolerance in both system and user software.
The upside of these difficulties is that the hardest problems we currently know how to attack may now be solved, giving us intellectual tools we could only dream about a few decades ago.
The next step in the trend toward larger and more distributed systems is termed “the cloud”. This concept is based on the fact that many very large collections of distributed systems exist, not all of which are kept fully occupied at all times. Similarly to the practice of thirty years ago when banks would rent time on their data processing systems, it is now possible to use computers that are geographically distributed and separately owned and managed, to accomplish large and long-running computations.
Many of the computers so used are intended for different purposes than they may “rented out” for. For instance, large numbers of individual private game computers are routinely used for protein modeling and signal processing tasks. The common feature of these applications are that they are easily partitioned into variable-sized sub-problems that can be packaged, delivered to computing resources, processed, and the results returned, all across the internet. The internet, as seen by most private individuals and many commercial firms, is but a slow, cheap, ubiquitous level in the speed-cost-size hierarchy.
How Does HPC Work in Practice?
Given the spectrum of High Performance Computing architectures we have seen, questions naturally arise. How successful have they been, over the years? Is this world still changing? What can be expected in the future? These questions can be controversial, but some definite answers can be made.
HPC began as clever circuitry cleverly implemented in fast, expensive technology, and evolved over time into multiplicities of cheap and common elements, of increasing scope and scale. It is inarguable that major developments in science and engineering have been established as a result of this evolution in tools. Also, world-changing capabilities have been achieved – developments in the handling of large databases, for instance. These developments are clear to anyone associated with science, engineering, business, finance, and government, and their applicability and proven advantages are too many to enumerate.
However, it is also the case the improvements in other areas have muddied the waters somewhat. Single processor workstations are now much more powerful than departmental computers of even two decades ago. Decreases in semiconductor feature sizes allow far more transistors per chip and far higher clock rates; these facts have led some to ask, why go parallel? The simple answer is that there are still problems not amenable to solution by single processors or computers. HPC practice has allowed such problems to be solved, even as the improvement in component technology has multiplied the effects of parallelism. These processes continue, and new problems join the list of successes yearly.
But what can we expect in the future? The continuing improvement in HPC is not advancing at the pace some would like to see. Super-systems are difficult to use and manage in proportion to their size and complexity. Currently, many more problems could benefit from HPC solutions than can use them, and many of these problems involve the management and manipulation of extremely large data sets. Work is ongoing to address this with improved high performance file systems.
