Forget beards and sandals. The leading open-source file systems are enterprise-class big hitters, with advanced features and massive, often parallel, scalability. Sure, many are under development and marked as unstable, but a small but significant number of open source file systems are torture-tested, running in production environments, and offer attractive benefits.
ZFS and GlusterFS lead the way
ZFS was developed by Sun Microsystems, which launched it in 2004 as a component of Solaris. It has since been released as part of OpenSolaris, but is also available as a standalone package.
The main features of this 128-bit open source file system are its ability to manage very large file systems, block-level data deduplication its protection from bit rot (AKA, silent corruption) by near-constant disk checksumming, and automatic repair. It also performs copy-on-write (COW), which writes data to a new disk block before changing pointers and committing the write. This means the file system is always consistent, and enables other features such as snapshotting of live data.
ZFS also provides volume cloning and management, which includes the ability to grow and shrink volumes. It can span and stripe across multiple volumes using RAID.
ZFS is the basis for a number of storage OSes, such as FreeNAS and NexentaStor, and can also be installed into Linux. ZFS is not a parallel file system, however, as it depends on NFS for its base file system capability. That will change when pNFS is ready, which is likely in the next year or so.
Red Hat's GlusterFS is an open source file system with parallel capability that scales to a claimed 27 brontobytes (1.24 x1027 bytes). It is aimed at those, such as cloud providers, who need a scalable, POSIX-compliant file system in the data centre.
It uses either its native file system connectivity, CIFS or NFS, to connect multiple servers and clients to its virtual storage pool, and provides a single global namespace, with files and metadata distributed across multiple systems. Features include distributed, replicated and/or striped volumes, high availability, and the ability to store files in a number of formats, including the Linux standards ext3/ext4.
Hadoop and others
Apache.org's Hadoop underpins Google's international data centres that allow the search engine company to deliver results quickly. Aimed at big data applications, it is designed to scale from single servers to thousands of machines, each offering local computation and storage.
Hadoop's primary file storage mechanism is HDFS, which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster. As an alternative, Hadoop can use GlusterFS, which Red Hat says enables simultaneous file-based and object-based access within Hadoop, eliminating the central metadata server and adding fault tolerance.
Another open source file system is BTRFS, a Linux-based COW file system featuring fault tolerance, space-efficiency, snapshotting, and easy administration. With a feature set similar to that of ZFS, it includes compression, defragmentation, checksumming and RAID, along with incremental backups and volume scrubbing, and will convert files from existing ext3/4 systems. Many features remain under development, however, and are considered unstable.
Ext3cow is based on ext3, and supports versioning through copy-on-write so you can access previous versions of files. The developers claim performance-equivalence to ext3, but there is no way of converting from ext3 to ext3cow, except by copying from one medium to another.
OpenAFS is a distributed file system that runs on Windows and Macs. It was donated to the community by IBM and enables collaboration by people in multiple locations who can work on the same files. While the files can reside anywhere in the distributed system, they appear to users to as local while AFS finds the correct file automatically. It uses the concept of cells, an administrative unit that allows storage to be tailored to a group of users without having to consult other cell administrators. For example, you can determine the numbers of clients and servers, file locations, and how to allocate client machines to users.
XFS is an extremely scalable parallel file system developed by Silicon Graphics (SGI) which can scale to billions of files. It works with SGI’s IRIX 5.3 onwards and has been ported to various Linux distributuions. It uses a journaling method of file system consistency where file system changes are written to a log before being updated. Metadata journal writing can cause disk contention with actual data so the option exists to place the journal on a separate physical device or entirely on cache memory. Snapshots can be taken using an external volume manager and backup and restore utilities exist within XFS.
XtreemFS is also a parallel file system that replicates file data across multiple storage servers and includes a replication algorithm designed to cope with a range of failure scenarios including message loss, network partitioning, and server crash. When a replicated file is opened, XtreemFS selects one replica as a primary, and this then processes all file operations. If the primary fails, one of the backup replicas will automatically take over after a short fail-over period. As well as Linux, XtreemFS packages are available for Windows and Mac.
Many of these systems, when run under Linux, FreeBSD, OpenSolaris or Mac OS X, operate in FUSE (Filesystem in USErspace), a loadable kernel module designed to enable file-based applications without having to alter the kernel. FUSE makes different file systems relatively easy to install, means a file system bug is less likely to bring down the system, and offers faster fixes because there are no delays resulting from updates and bug fixes having to be compiled into the kernel.
One issue that may concern users is that ZFS is available under Sun's (now Oracle's) Common Development and Distribution License (CDDL) not GPL, the more common open source licence. Incompatibility of the two licensing models means that it could prove impossible to combine two pieces of code with different licence requirements and satisfy the terms of both.
Oxford Archaeology to deploy BTRFS
Chris Puttick, strategic consultant for Oxford Archaeology, plans to convert the organisation's 90 TB Linux-based backup server from ext4 to BTRFS. The system is used for backing up client machines, and so integrity and features rather than performance are the priority.
"We looked for a cleverer file system," Puttick said. "We looked at ZFS but the Sun open source licence, CDDL, makes it hard to bring into systems so we pulled away. We also looked at XFS but are now betting the future on BTRFS."
For Puttick, the key features are bit-level integrity, support for very large file systems, easy conversion from ext3/ext4 file systems, snapshots, checksums of individual files, incremental backup, and online file system checksumming.
"The server needs a lot of storage because you need multiple copies over many weeks. It's now running ext4 but will convert easily to BTRFS," he said. "It uses free space in the existing file system and uses that to write additional information."
Puttick said that, although BTRFS is still under development, he intends to use only those features that are considered stable. "The features that are stable are what we need; for very large file stores where you cannot afford to lose a file, having a file system that checks file integrity is powerful."
This was first published in August 2012