A number of file systems contain features characteristic of the way they tend to work, in general. We’ll look briefly at The Unix File System and ZFS, as examples of traditional and modern technologies. Then we’ll briefly look at two more complex file systems, GPFS and Lustre, as examples of HPC file systems. This is intended to be a conceptual survey, not exhaustive but introductory in concept and terminology.
The Unix File System
The Unix File System (UFS) is also called the Fast File System (FFS). It maintains user data and bookkeeping information, mostly consisting of “inodes” and “data blocks”. The inodes are sequentially numbered blocks of storage that contain file metadata for whole or partial disk drives. Directory information stored in them includes lists of files and their inode numbers. This scheme has been augmented in various ways, to improve performance, efficiency, and reliability. A traditional weakness (now largely overcome) was that the number of inodes on a volume fixed when the device was incorporated into the file system, and changing it was a destructive operation. With modern enhancements, differing by vendor and platform, UFS continues to be a well understood and commonly used file system.
The ZFS File System
Originally called the Zetabyte File System, offers expanded capabilities that overcome some of the shortcomings of older file systems. ZFS is an open source system built on virtual storage pools (“zpools”), which are constructed of virtual devices (“vdevs”); vdevs, in turn, are constructed of physical drives, RAID subsystems, or drive partitions. ZFS is a “128 bit” file system, which means 128 bits is the largest size address for any unit within it. This size allows capacities and sizes not likely to become confining anytime in the foreseeable future. For instance, the theoretical limits it imposes include 2^48 entries per directory, a maximum file size of 16 EB (2^64 or ~16 * 2^18 bytes), and a maximum of 2^64 devices per “zpool”.
ZFS allows drives to act as hot spares, which can be swapped in while the file system is in use. Read and write caching is supported, as is mirroring. Zpools can contain heterogeneous collections of devices, and can be expanded at any time. Quotas can be imposed on file sizes, and space can be reserved in advance of need.
An interesting and novel feature of ZFS is that it adheres to a “copy-on-write” transaction model: blocks of active data are, by default, never overwritten; instead, new blocks for data and metadata are allocated and written before the transaction is “committed”; afterward, both the old and new state of the file system remains available. The old are maintained, and the old file system state can be recovered quickly and easily. An obvious disadvantage of this is that stale or dead storage is not reclaimed for reuse, but large disk drives are cheap, and file systems already contain significant redundancy, trading efficient space usage for reliability and speed. File system “snapshots” are automatically maintained, and entire file systems are thus easily cloned. Similarly, creating a new file system within a pool is quick and easy.
File update transactions can be organized into groups, for efficiency, and data blocks are organized into tree structures. Device striping is dynamic, as are pool sizes; when new devices are added, they are automatically integrated; in the case of failures, the file system automatically “heals” when a new device is swapped in. Checksumming is heavily used within storage pools.
Performance-based options include explicit I/O prioritization and scheduling (with deadlines), transparent compression, load and space sharing within pools, configurable data and metadata replication, and optional write caching.
ZFS is available for a number of platforms and systems, and development is ongoing.
Parallel File Systems
Parallel HPC systems impose special demands on their file systems. The bottlenecking effect of disks operating several orders of magnitude slower than processors that access them is multiplied under parallel execution. The quantities and sizes of files read and written by HPC systems tend to be much larger than normal file systems can handle. Parallel file systems can be considered supersets of distributed file systems, and the orchestration and management of parallelism in a file system is just as significant for performance as it is in parallel programs in general.
The GPFS File System
IBM’s General Parallel File System is a high performance, disk sharing, clustered file system, that us usable with AIX clusters, Linux clusters, and Windows nodes. It includes extensive management tools, and allows sharing file systems among remote GPFS clusters.
GPFS provides high I/O performance by striping individual files across multiple disks in parallel transfers. File system nodes contain RAID controllers, and storage pools allow disk grouping within a file system. Distributed metadata storage includes directory tree structures; applications do not need to know where their data is stored. Distributed locking is possible, and full POSIX file system semantics are implemented.
A noteworthy characteristic of GPFS is its graceful degradation under network failures that partition the system. File system maintenance can be performed with the system is in use. Administration and configuration are flexible, and the file system namespace can be partitioned.
The Lustre File System
Lustre is a parallel distributed file system capable of scaling up to very large cluster systems. It is distributed under the GNU Public License as open source software. In it, metadata and user data are stored on and served by separate device sets; these are termed MetaData Server and Target, and Object Storage Server and Target, respectively, with the distinction between “server” and “target” being similar to that between logical and physical device.
The total Lustre File System capacity is the sum of the capacities of its Targets. The MDT and OST can be on the same node, but usually are not, with two to four OSTs per OSS node.
There is at least one MetaData Target (MDT) per file system, storing the file layout for the MetaData Server (MDS), in addition to user file and directory information. The MDT is a dedicated file system controlling file access and informing clients which objects make up a file.
There is also at least one Object Storage Server (OSS) per file system, storing user file data on at least one Object Storage Target (OST); however, one ODS can typically serve from two to eight OSTs, each one a local file system of up to 8 TB. The OST is a dedicated file system providing an interface to byte ranges of objects for all access operations.
Clients see only a single unified file system, and access user data via standard POSIX semantics. When clients address an MDS, a file is created or a file layout is returned. In reads and writes, the file layout is mapped to one or more objects on an OSS.
Concurrent read and write access is possible to the stored files. After the requested object is locked, parallel accesses are performed directly on the OSTs. Modifications to stored objects takes place via delegation to the OSSs, allowing scalability and enhanced security and reliability.
Lustre supports numerous types of communication networks, and is capable of Remote Direct Memory Access (RDMA). Robust failover and recovery are supported, as are live upgrades. Plans exist to support ZFS on the individual devices.
The Bottom Line
The characteristics of file systems for conventional computers are well understood, and codified within POSIX standards. Even so, there are significant differences among them, and they continue to evolve. Parallel file systems, unsurprisingly, are far more complex, and their features are evolving to address future demands. Among the most important and complex features of HPC systems is fault tolerance, for the most powerful system is useless if its results are untrustworthy. File system integrity is as important as computational correctness, for result data must be stored before it can be studied.
