Cloud storage poses awkward questions for file storage

The advent of cloud storage, with its demands for huge and dynamic storage schemes, has raised questions over the fundamentals of how we store files.

There's an ongoing cloud storage debate about whether data stored in the cloud should be stored and accessed as a file or as an object. Protagonists on both sides talk about the need to store billions and perhaps even trillions of pieces of data in the cloud. The storage infrastructure has to be able to scale to such levels and cope with data moving from one physical storage container to another.

Those who support using object-based storage say that a file is both a name and a location, and that essentially similar files can have different names and then waste storage. In addition, the hierarchical structure of a file system tree makes for slow navigation compared to the simpler object address location lookup.

Should you design an object storage system from a blank slate or can you layer it onto an existing file system?
Let's start with a single data element, a Word document, and treat it as a file and as an object to see how this works. The sample file name is C:Workfilepurchasesorder1234.doc. If we hash the contents, we then arrive at an object name, which is D34gb78k8923L67.

The object name is unique to that document and is the only metadata associated with the object. The object store maps the object's physical storage address to the object name and to any separately stored metadata for it. If the object is moved, its physical address in the lookup table is altered and access remains the same.

Now let's turn to the file. For starters, its full name is also its address. We know that it's a Word document because the .doc extension tells us so. We know that it's on the current computer's C: disk drive, two folders down the folder tree from the start folder. The application that created it, Word, actually stores a lot of metadata in the document, such as name, author, creation date, last modified date, length and so on. This is invisible to the file system.

If the file is moved, and we later tell Word to find it, an error is returned. The file system isn't responsible for telling us when a file has been moved. These aspects of file system storage render it unsuitable for storing billions of files in the cloud, where they will inevitably move and where we need to be assured of their integrity and have the ability to move and reference sets of objects.

EMC has created a new object-based storage product called Atmos for its cloud storage customers. NetApp says that sort of root and branch, rip and replace of existing file-based storage systems isn't necessary. They say users can remove locality from filenames and refer to groups of files already -- it's called a folder.

In a blog written by Alex MacDonald, a competitive analysis team member at NetApp, MacDonald uses the example of the file, and says it "doesn't exist until you ask for it, because it's dynamically generated from parts. There's no directory shadeofblue, 2009, or 08, and no file poetry-corner.html either. And (here's a clue how far this goes) it doesn't live on a server at either. It's all name and no location, and it works across the entire Internet, not just inside a single object store."

The string he refers to is an Internet relative uniform resource name (URN), not a standard file name, and its components can be dynamically mapped to actual stored data without the person or application specifying the URN knowing where the components are actually located.

With a relative URN approach, MacDonald said, you can have the following:

  • Transient objects with a limited lifespan or number of uses
  • Dynamic objects that are programmable
  • Fragmentable objects where an object can be split into distinct parts
  • Modifiable objects that change in response to external events (for example, stock market data)

That's a long way from Word docs and what we normally think of as files. NetApp stores data on its arrays using its Write Anywhere File Layout (WAFL) concept. This is then used by NetApp's Data Ontap software to present the data as either files or blocks. The company is also going to add object storage functionality to this.

For a business wanting to move into cloud storage, the upside to the NetApp approach is that you can use existing NetApp storage.

A startup cloud services provider, with no legacy data and infrastructure, can start with a blank slate (or clear sky) and design a storage infrastructure that perfectly suits its need to store billions of items and to have metadata stored with them in a way that isn't compromised by existing storage schemes. Nirvanix is an example of this.

A storage company wanting to build a cloud storage product and become a cloud storage supplier can do the same thing, and EMC is probably the best example of that.

The argument boils down to this: Should you design an object storage system from a blank slate or can you layer it onto an existing file system? Protagonists on the object side say that a ground-up object design can do better things with metadata.

Chuck Hollis, EMC's vice president and global marketing chief technology officer, had this to say in his blog: "... it's this metadata that's the foundation for all the cool things we all would like to do with information services around the object: serving it up at the right service level, placing it in the right location at the same time, figuring out how long to keep it or when we can delete it, enforcing security and compliance policies, how it relates to other objects and business processes, etc. etc. etc."

NetApp's MacDonald would have his object store on top of a file system. Indeed the files become objects. But which choice is correct in theory probably doesn't matter outside of a computer science lecture room, because it's what works best in practice that counts. For that, we will have to wait and see.

Chris Mellor is storage editor with The Register.

Read more on Data protection, backup and archiving