Programs Use Data

In the HPC world, the norm is that programs read their inputs and then write their outputs more or less continuously until they terminate. Large-scale scientific modeling programs tend to produce large amounts of output. Some programs read large amounts of data, often produced by large experiments, process it in search of “interesting” events, and produce smaller, but still large, amounts of output. In short, big programs that run on big systems read and write big data files. This is the crux of the problem.

Large Output Files

Typical scientific HPC runs of simulation-modeling programs tend to read input data, in the first few seconds of execution, that input characterizes the problem being simulated. They then evolve a simulation of a physical system for a specified simulation time, periodically writing out the state of the simulated system at specific simulation times. Such outputs are typically designed to be easily usable as input for subsequent runs. The largest models now running on the largest systems in use can take days to reach termination conditions, on hundreds of thousands of processors. These large computer systems are subject to occasional component failures, so the programs that run on them regularly produce, in addition to simulation outputs, large amounts of “checkpoint” output, suitable for use in restarting the run, should something interrupt it.

Simulation output may be of several sorts. It may be data used to visualize the state of the modeled system, which is intended to overcome the overload of large ensembles of numbers. It is also useful for checking the validity of the model and the progress of the run in the desired direction. Finally, it can be used in tracing execution errors, caused by logical design errors in the simulation code. Debugging lengthy parallel computations is very difficult, and programmers often include details in their output to assist in this effort, though not relevant to the particular problem being solved.

Checkpoint data is usually the complete state of the computation on each node at an instant in its execution; this is more comprehensive than the state of the modeled system, and almost always larger, containing moment-by-moment bookkeeping details that allow the computation to be easily and immediately restarted from it. In comparison, while the simulation output may be transformed into an input file for a new run, the checkpoint can be used to restart the current run, for instance if a node failure requires remapping the run onto a new partition of the computer system. Checkpoint output is typically not produced as often as simulation output, and is stored differently from it; for instance, one checkpoint may overwrite the previous one, while simulation output tends to be cumulative.

Either sort of output may be very large. Consider a 10,000 processor run in which each processor maintains and evolves one billion state elements for the simulation. If simulation output is written once per minute over a ten hour run, the total output will be six quadrillion numbers, or in single-precision, twenty-four quadrillion bytes. If each processor writes each output into a separate file, the result will be 6 million files. If each node contains 16 GB of RAM, each checkpoint output will contain 160,000 GB of data

While economies doubtless can be brought to bear on these large numbers, the nature of the problem is evident. Runs of this magnitude are very expensive, being possible on only the largest and costliest systems. Therefore, the output is precious and must be protected. File systems of sufficient capacity, speed, and flexibility are needed, or the computation cannot progress. No single device of file system currently can handle the number of bytes or of files that may result. And no single device should be trusted with so much crucial data; it should be spread among multiple devices, for speed and capacity, and replicated, for safety

In HPC systems not used for scientific simulation, we often find that while the computations are more readily manageable, being based in units of transactions or database queries, the data sets they operate on are just as large and even more valuable. Google’s enormous database of information gathered from the World Wide Web constitutes a very large database that is read and written continuously by thousands of computers. Any financial firm is likely to have large databases of client information, even more important to the firm and its customers than Google’s WWW information. Large manufacturing firms (Boeing, Ford, Sony, etc.) must keep track of their business transactions and supply chains, design information, and production system status. The US Government maintains large databases for the census, public services, and presumably other purposes not publicly discussed.

These non-science-related databases tend to be far more distributed than scientific ones, in the sense that the storage location of a datum will not likely be “near” the processor that will next need to access or change it. These databases tend to be long- lived. Scientific data may be used and discarded, but a bank’s account records must be persistent, perhaps over long time intervals. Some databases are read more often then they are written, while others are continually modified, and this mandates different management strategies.

The key point is that there are many large scientific and non-scientific databases in the world, and they all constitute significant management problems from hardware to software, from front end terminals to back end storage devices.

Storage Devices and Systems

Not long ago, the storage medium of choice was magnetic tape. Its advantages of high density and low cost were balanced by its disadvantage of being easily damaged by transport malfunction, and its sequential access mode. However, for the largest mass storage systems, tape cartridges, mounted and accessed in robotic libraries, are still much favored over disk drives. Disks have the disadvantage of operating whether or not in immediate use, but the higher performance of more immediate access modes. Modern archival databases tend to use a combination of these devices.

A third option is Solid State Disks (SSDs). These are banks of semiconductor memory, slower than computer RAM, but non-volatile in the face of power loss. They tend to be used as if they were rotating disks, although lower in storage density; and, they are more expensive and considerably faster than disks. Traditional disk drives tend to be a good combination of high density, high speed, high reliability, and low cost, so they are more prevalent, at present. SSDs, however, are useful in specific parts of advanced file systems.

The physical characteristics of rotating disks impose special techniques their use. These are usually buried in device driver and library software, for the sake of performance, but they are worth knowing about. Disks are organized into blocks, tracks, and surfaces. Each physical magnetic disk has two surfaces that can contain data. They contain data written into concentric tracks placed small but safe radial distances apart, from near the center of the disk to near its edge. Each track is broken into blocks of contiguous bits, comprising the smallest addressable unit of storage. The blocks hold user data as well as management overhead data; the latter may include error correction values, which are functions of the user data, and location links to previous and subsequent blocks of a file. Additionally, the file system itself will be stored on the disk as a set of data structures that define and describe the file system’s contents.

An important characteristic of disk drive performance is “latency”, the time it takes to read or write the first bit. Latency is caused by the rotation of the disk and the positioning of the read-write head; the desired data block will take, on average, half a rotation to arrive under the read-write head, and the head may take as long to move to the correct concentric track in which the data block is located. Once this latency has passed, the next 10,000 or so bits will read at a rate determined by the data density and rotation speed of the disk drive, and the bandwidth of the interface connecting the drive to the computer. Typical commercial drives can sustain data rates in the billions of bits per second, and rotate up to 10,000 times per minute. This means, for example, that reading a one million byte file could take as long as 0.01 seconds to start but only another 0.01 seconds to complete.

The most severe limitation on disk performance is the number or input-output operations per second (IOPS) that can be performed. This is generally relatively low – often under 100 per second – and can have severe effects where small reads and writes are common. However, even for large file accesses, it tends to be the most intractable limit, keeping disk performance well below what might be inferred from bandwidth capabilities. This is a justification for SSDs: if they can be used for small-scale I/O and storage, such as for indices and directories, as opposed to large user files, the advantage of zero latency becomes significant.

Writing such a file, typically takes more time than reading it, because the file system must be queried to find a place to put it, and to annotate its location when writing is finished. File systems are themselves programs containing complex data structures that manage and maintain the state of the storage devices, and which are also stored on special reserved portions of the disks. A file system may index and link the file and directory structures that are stored on the disk, and at least one other index of free versus occupied blocks on the disk. These file system data structures are often termed “metadata”, and they become more complex wide-ranging in larger and more complex HPC file systems. It is common for a file to be stored in multiple non-contiguous blocks that scattered across multiple tracks and surfaces of a disk drive. However, for efficiency, disk space is sometimes left unused by the file system, so that files can be stored as contiguously as possible. For some applications, this efficiency is a necessity.

In a conventional computer, such as the one being used read this document, disk access can occur continuously. Portions of the operating system, applications, and user data are constantly being read and written, and there is competition for service among entities that use the file system. That competition must be addressed by clever algorithms, policies, and mechanisms, in order to manage storage resources efficiently.

Programs that need data usually must wait for disk reads to finish, but programs that produce data it may be able to continue execution while disk writes take place. This can lead to loss of data if a write failure occurs, and the program has continued and overwritten that originating data structures. File systems may seek efficiency by prioritizing reads and writes, along with movement of the read-write heads. They may also manage and assign different sized portions of storage space, depending on circumstance. If reads conflict, individual files can be “locked” or reserved for use by a single entity or process, and in more complex circumstances, locking and space assignment may be taken to larger or smaller extents, from individual blocks of a file to entire surfaces of a drive. Finally, the ability of a programmer to predict program needs for I/O and file space may be used to advise the file system, and thereby improve performance.

Even on conventional computers, multiple physical devices may be used, and in different ways. File systems routinely manage multiple physical or logical devices. The distinction here, is on what something is versus what it appears to be; a physical drive is just that, whereas a logical drive (or “volume”) is the view of a set of devices that a file system presents to the user. Logical devices may be made of multiple physical drives, either concatenated together or used in parallel; this allows several drives to become a single volume, possibly of greater capacity than a single drive. Alternatively, a single physical device may be divided (“partitioned”) into multiple logical “volumes”; for instance, users could be given separate partitions, logically separating their storage spaces.

A common way of using multiple drives in concert is called a RAID (Redundant Array of Inexpensive Drives) file system. In a RAID system, multiple drives can be used in a number of different configurations, ranging from improving I/O speed by splitting data across several drives, to improving fault tolerance by replicating data across several drives. There are a number of common RAID configurations, with the most complex offering both redundancy and speed advantages.

More complex use of multiple volumes may allow further performance gains. One drive can be read while another is written, simultaneously. I/O streams may be spread across multiple devices, as in a parallel computation where each computer node accesses its own drive subset. There may also be separate drives or nodes used to serve the file system metadata, so requests for location and status of portions of storage can be satisfied independent of the reads and writes of actual data.

Conventional Computing Systems and File Systems versus HPC Systems

Conventional single-user computers do not usually have or need high performance file systems. Access to storage by multiple executing programs is infrequent and generally of low order; the OS may need swap space or file buffers, while an application is handling document I/O, but this sort of competition is usually easy to prioritize and satisfy. Similarly, file space needs are usually relatively straightforward, and satisfiable without much management overhead. Thus, conventional file systems are usually provided as components of operating systems. This is not to say that such file systems are simple or primitive; years of effort and study have gone into making them fast, clean, and reliable. But they are relatively well understood, and fairly uniform in their characteristics and capabilities.

The situation with HPC systems and their file systems is far different. Multi-processor computers are likely to have multiple-drive disk subsystems. Some will have drives assigned and attached to specific nodes and directly accessible without communication across the interconnect. The elements of the disk subsystem may be dynamically configurable, with individual drive subsets assignable to different duties on a per-node basis: read-only, write-only, or read-write; alternating-read-write, temporary-reusable, or permanent.

Dedicated file system and metadata servers can act as intermediaries between the nodes executing the program and the drives holding the data. SSDs offer performance advantages in metadata servers due to relatively high rates of input-output operations.

In very large systems, the storage subsystems can become so large that device failure is a real problem. Even when data is replicated across drives, a failure during a write operation can cause loss of data if the RAM buffer that held it before transfer to the storage subsystem is reused before the disk writes are completed. The most advanaced HPC file systems allow component drives to be safely replaced or upgraded in place with only transient loss of performance, while safeguarding data integrity.

Careful design and delicate tuning are necessary for efficient use of the largest and highest performance systems on the largest and most demanding problems.