Sun ZFS 101

Learn the basics of Sun's ZFS file system in this tutorial.

Sun ZFS is a new kind of file system -- it is a fundamentally new approach to data management. Sun's ZFS file system is the brainchild of Jeff Bonwick, Sun Microsystems Chief Technical Officer of Storage Technologies, who spent years working on the Solaris Virtual Memory system, which applied virtual memory concepts to storage.

When you add a DIMM, you don't partition it, you don't allocate it, and when it is replaced you don't fsck it. Memory management is something we all take for granted because complex software masks management away. ZFS was born with the intent to bring the same advantages to storage.

At first glance, the most striking feature of ZFS is that it combines the volume manager (which virtualizes disks, typically via RAID) and the file system into a single piece of software.

Disks are formed into a storage "pool" using a single simple command:

 [email protected] ~$ zpool create mypool c4t0d0 c10t0d0 c11t0d0 [email protected] ~$ zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT mypool 2.77G 144K 2.77G 0% ONLINE - [email protected] ~$ zpool status pool: mypool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c10t0d0 ONLINE 0 0 0 c11t0d0 ONLINE 0 0 0 errors: No known data errors [email protected] ~$ zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 73.5K 2.72G 18K /mypool [email protected] ~$ df -h | grep mypool mypool 2.8G 18K 2.8G 1% /mypool

Using the simple command "zpool create" we specify a name for our pool ("mypool") and then the drives we want to assign to that pool, in this case three. This creates a "dynamic stripe," akin to RAID0 with the added bonus that strip width is not set in stone.

The proceeding commands in the example explore our new creation.

* "zpool list" will give you a quick summary of available pools
* "zpool status" will output information about the pool configuration, status and errors
* "zfs list" will show us all our available datasets
* "df -h" is our old and trusty friend that displays information about mounted file systems

This is the fantastic part. Notice carefully what has happened and what we didn't need to do! The disks were partitioned, the RAID was set up and made ready, the file system was created and it was mounted.

Using the traditional Unix Logical Volume Manager we would need to partition the disks, create physical volumes, then create a logical volume, then create a file system, create a mount point and mount the volume.

Furthermore, Veritas Volume Manager or Logical Volume Manager commands are long, complex and frustrating!

With ZFS this was one simple and easy-to-understand command without any hard work. If you don't appreciate the full gravity of how remarkable this is, you should spend more time setting LVM or VxVM and then come back.

Drawing from the memory paradigm, now that the pool is in place we won't touch it again unless we want to add or replace disks in the pool. Really and truly, that is it.

File systems: Data sets and properties
ZFS changes the way we think of a file system. Traditionally we are limited to one file system per volume. And why shouldn't we be? Except consider that individual file systems may need different mount options or mount points. The way to solve this in the past was to create smaller volumes and create separate file systems for each purpose, but given the complexity in managing this you can only get so granular. There is also a considerable overhead in disk consumption for each file system.

In ZFS we instead think of a pool containing multiple data sets. A data set is a generic term that is for all intents and purposes just like what you consider a file system to be. A data set can have any mount point you wish, and can enable or disable certain mount options, such as turning "atime" on or off, setting read-only, etc. Except that now datasets are extremely lightweight and mount options are replaced with "data set properties".

Furthermore, data sets can be nested to form more management structures. In this way, data sets now become a point of administrative control for assigning quotas, mount points, compression, etc. Essentially you can create hundreds or thousands of "file systems" on a single system, perhaps one for each user's home directory. The more data sets, the more control you have.

Let us create some data sets using the "zfs create" command and change the point for them using the ZFS "mountpoint" property.

 [email protected] ~$ zfs create mypool/home [email protected] ~$ zfs create mypool/home/user001 [email protected] ~$ zfs create mypool/home/user002 [email protected] ~$ zfs create mypool/home/user003 [email protected] ~$ zfs set mountpoint=/myhome mypool/home [email protected] ~$ zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 186K 2.72G 19K /mypool mypool/home 76K 2.72G 22K /myhome mypool/home/user001 18K 2.72G 18K /myhome/user001 mypool/home/user002 18K 2.72G 18K /myhome/user002 mypool/home/user003 18K 2.72G 18K /myhome/user003

Notice that we've created nested data sets, and when I changed the mount point for "mypool/home" it also trickled down to all its children. This recursive behavior is known as "inheritance." I could easily override any one of the children, but when dealing with large numbers of data sets this makes life much simpler.

Now let us uncover some of ZFS's deep magic by looking at ZFS data set properties on one of these data sets:

 [email protected] ~$ zfs get all mypool/home NAME PROPERTY VALUE SOURCE mypool/home type file system - mypool/home creation Wed Dec 31 14:01 2008 - mypool/home used 76K - mypool/home available 2.72G - mypool/home referenced 22K - mypool/home compressratio 1.00x - mypool/home mounted yes - mypool/home quota none default mypool/home reservation none default mypool/home recordsize 128K default mypool/home mountpoint /myhome local mypool/home sharenfs off default mypool/home checksum on default mypool/home compression off default mypool/home atime on default mypool/home devices on default mypool/home exec on default mypool/home setuid on default mypool/home readonly off default mypool/home zoned off default mypool/home snapdir hidden default mypool/home aclmode groupmask default mypool/home aclinherit restricted default mypool/home canmount on default mypool/home shareiscsi off default mypool/home xattr on default mypool/home copies 1 default mypool/home version 3 - mypool/home utf8only off - mypool/home normalization none - mypool/home casesensitivity sensitive - mypool/home sharesmb off default

Here we have a variety of useful knobs to turn. The first several properties are informational, such as creation time, space used, available and referenced. Here is a short list of what some of these properties are and how to use them:

  • * quota: Space quotas can be imposed anywhere, and they are recursive by default. Simply "zfs set quota=10g mypool/home/user001" and that user can never use more than 10GB of disk.
  • * reservation: Similar to a quota, but reservations are "pre-allocated." The disk space is removed from common use so that the space is guaranteed.
  • * mountpoint: Place where the data set is mounted. Mount points are created by ZFS on your behalf.
  • * compression: Just "zfs set compression=on mypool" and all new data will be compressed! Define this for everything or on a case-by-case basis.
  • * atime: If atime is "on" everytime you touch a file, its time stamp will be updated, which can be a lot of unwanted write activity. Use "zfs set atime=off mypool" to disable it.
  • * readonly: Want to lock away archive data? Just "zfs set readonly=on mypool/home/user002" and the user can look but not touch.

Three really exciting options above are "sharenfs", "shareiscsi", and "sharesmb". By simply turning the property on (zfs set sharenfs=on mypool/home) you've exported that, and any children, via NFS. No fuss no muck, turn it on and your done. The same applies to iSCSI or CIFS ("smb").

Like file systems, we can also create block volumes as easily. With these volumes we can create legacy file systems (UFS, VxFS, etc.) or share iSCSI block volumes.

 [email protected] ~$ zfs create mypool/volumes [email protected] ~$ zfs create -V 500m mypool/volumes/volume001 [email protected] ~$ zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 500M 2.23G 20K /mypool mypool/home 76K 2.23G 22K /myhome mypool/home/user001 18K 2.23G 18K /myhome/user001 mypool/home/user002 18K 2.23G 18K /myhome/user002 mypool/home/user003 18K 2.23G 18K /myhome/user003 mypool/volumes 500M 2.23G 18K /mypool/volumes mypool/volumes/volume001 500M 2.72G 16K -

You can see that creating a block volume data set is done in the same way as file system data set. We simply add "-V" proceeded by the desired size.

If we'd added the "-s" flag after "create," we would have created a "sparse" volume, which is better known as thin provisioning. Thin provisioning means that we've defined a block allocation but we're not going to actually steal away the blocks until they are actually requested. In this way, we could create dozens of block volumes even if we didn't have enough space for them right now. Because resizing file systems can be complex, this allows us to oversize for the future even if the disk isn't actually available at the moment.

Here is a grotesque example on my little pool with only 2.2 GB available:

 [email protected] ~$ zfs create -s -V 1t mypool/volumes/megavol [email protected] ~$ zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 500M 2.23G 20K /mypool ... mypool/volumes 500M 2.23G 18K /mypool/volumes mypool/volumes/megavol 16K 2.23G 16K - mypool/volumes/volume001 500M 2.72G 16K -

Notice I created a 1 TB volume, but its only consuming 16 K.

Snapshots and cloning
ZFS makes things easy to create and manage as we've seen, but it also brings enterprise-grade features down to the average user. The best example is that of snapshots and cloning.

We can create a snapshot using the "zfs snapshot" command, and following the data set name with an "@" and the desired snapshot name.

 [email protected] ~$ zfs snapshot mypool/home/[email protected] [email protected] ~$ zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 500M 2.23G 20K /mypool mypool/home 76K 2.23G 22K /myhome mypool/home/user001 18K 2.23G 18K /myhome/user001 mypool/home/[email protected] 0 - 18K -

The "@" character specifies a snapshot, followed by the name. Snapshots are lightweight and created instantaneously.

One of the advantages of snapshot is the ability to cherry pick files out of them.

In the "/myhome/user001" mount point we'll find a hidden directory that can not be seen but will give us access to the snapshot contents:

 [email protected] ~$ cd /myhome/user001 [email protected] user001$ ls -alh total 3.0K drwxr-xr-x 2 root root 2 Dec 31 14:01 . drwxr-xr-x 5 root root 5 Dec 31 14:01 .. [email protected] user001$ cd .zfs [email protected] .zfs$ ls -l total 0 dr-xr-xr-x 2 root root 2 Dec 31 14:01 snapshot [email protected] .zfs$ cd snapshot/ [email protected] snapshot$ ls -l total 2 drwxr-xr-x 2 root root 2 Dec 31 14:01 snap01

Here we can traverse the file system as it appeared at the snapshots point-in-time and recovery files by simply copying them out.

Snapshots are used for many things, but let's look at cloning. For example, if you are working on a project and want to create a copy of it for another user so he doesn't mess with your work, no problem! Create a snapshot and clone it!

 [email protected] ~$ zfs create mypool/project [email protected] ~$ zfs create mypool/project/working [email protected] ~$ zfs snapshot mypool/project/[email protected] [email protected] ~$ zfs clone mypool/project/[email protected] mypool/project/working-copy [email protected] ~$ zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 500M 2.23G 21K /mypool mypool/project 37K 2.23G 19K /mypool/project mypool/project/working 18K 2.23G 18K /mypool/project/working mypool/project/[email protected] 0 - 18K - mypool/project/working-copy 0 2.23G 18K /mypool/project/working-copy ...

You can take that clone and NFS or CIFS share it, or do whatever you like!

ZFS brings enterprise storage capabilities to any system of any size. All the examples I used were preformed using three 1 GB USB sticks. Administration is simple, easy to understand and extremely fast. I hope this article has given you that warm fuzzy feeling that will help you get started using this amazingly powerful open source technology in your environment.

Read more on Storage management and strategy