Managing data with an object storage system
A comprehensive collection of articles, videos and more, hand-picked by our editors
With the rapid growth of unstructured data in corporate systems, technologies that can effectively store large amounts of discrete files have begun to emerge. Scale-out and internal cloud systems are chief among them, but often under the hood of these approaches is an entirely different way of handling data: object-based storage.
To date, the file system has been king. It provides some form of central database that is referred to when an application requests data. The database typically contains information about the file, including directory (tree) information, physical (hard drive) location of related blocks, security and access restrictions.
It’s a hierarchical, location-based method of storing information about files. As data stores grow, traditional file systems soon become very complex. And they’re limited in the number of files they can deal with.
One response to ballooning numbers of files is to partition file systems into volumes. This helps resolve latency issues but adds management overheads as data becomes separated from other data.
Additionally, as file systems get bigger, the usual methods of protecting data against hardware failure—such as RAID—become too cumbersome. Not only do RAID rebuilds take days, RAID also imposes rigid disk organisational structures.
It is also expensive to power huge disk arrays, and RAID can offer no protection against bit errors on disk, which become statistically more likely as capacities increase. All this suggests that a new way of storing and managing files is called for.
Object-based storage scalability and robustness
Object-based storage dispenses with the hierarchical file system, so there are no nested subfolders. Instead, it uses an indexed flat-file structure in which each object is given a unique ID. Each object acts as a container for end users' files, and indexes are often held in RAM, making them very fast to access.
Additionally, object stores tend to include more metadata, which then enables storage managers to contextualise a file and so manage and back it up more efficiently. For example, a JPEG file might be a picture of an employee's new baby, or it might be a product blueprint. Outside its original context—usually its directory—there is no way of understanding the file's content without richer metadata.
There are strong similarities between the major vendors' various implementations of object-based storage, although they do differ in detail and positioning.
EMC offers its Centera Content-Addressed Storage System for object-based archiving. The system consists of a set of Linux-based networked servers divided into storage and access nodes. The company also promotes its cloud storage platform, Atmos, for production data.
Dell's DX6000 Object Storage system uses Caringo's CAStor software to manage the hardware. It uses a memory-based index, which, Caringo says, means a file's address can be found within 20 microseconds. According to the company, 4 GB of RAM can contain an index of 400 million objects, and the system scales without performance penalty. Dell does not distinguish between the archive and production storage markets and is effectively in competition with both of EMC's products.
NetApp's StorageGRID object-based product uses the company's Ontap GX storage operating system. Its file and metadata repositories can accommodate billions of files and petabytes of capacity. You can access the system via HTTP or using CIFS or NFS, and the system can be federated across multiple sites.
So, there's a range of object-based storage products designed to fix the problems created by ever-growing numbers of files. Now let’s take a look at how object-based storage products are being implemented in Europe.
University of Utrecht uses object storage for genome research
The University of Utrecht's Biomolecular Mass Spectrometry and Proteomics Group, in the Netherlands, generates more than 5 GB of data per day and needed a way of safeguarding the 64 TB of data generated by its mass spectrometers. The group also needed to reduce the amount of time academics spent running backups to tape and managing the system as well as the cost of expansion.
Assistant Professor Bas van Breukelen, bioinformatics group leader, said the previously installed EMC Celerra NS502 array was expensive to run and expand, as it needed an EMC technician to attend to it. "You also needed additional licences and servers, and the more disks you added, the more servers you needed," van Breukelen said.
"We started with 4 TB, then we needed 12 TB, and that meant we had to buy a complete new server," van Breukelen said. "Then the 12 TB ran out." The department looked at other products from EMC and HP, before considering Dell’s DX6000 Object Storage platform.
Van Breukelen said the Dell DX6000 appealed because it does away with the expense, management overhead and rebuild times of RAID as a result of the object store's redundancy features.
"The EMC was a black box, but this is open," van Breukelen said. "It's just a bunch of disks that communicate with each other. You can address them individually, and you can use normal SATA disks. If you need more disks, you just add an enclosure with eight disks and throw it in there. And the servers are standard, so in future when servers get faster, you can just add a new one."
Tape backups are a thing of the past too as the department has mirrored the object store to another site in the university. "We have four servers in the primary installation, four in the secondary mirror, and a fifth server that acts as a file server," van Breukelen said.
Van Breukelen now has 64 TB of data in each of the two mirrors under his charge. He said managing the system is easy using standard tools because the system automates data retention, retrieval and deletion.
For van Breukelen, the advantages of object-based storage are clear. It is easy to use, is maintenance-free and highly reliable, so even if a piece of hardware fails, the data remains intact. And as an added bonus, it also uses less power than the EMC equipment.
École Polytechnique Fédérale de Lausanne jazzes up its storage
École Polytechnique Fédérale de Lausanne (EPFL) is a sponsored university in Switzerland. It is close to Montreux, home of the famous jazz festival, whose organisers had a problem. They had 45 years of video and audio in 24 different formats containing 5,000 hours of concert time. This included records of performances as well as metadata, such as set lists and photos.
The tapes were aging, so the festival’s organisers asked EPFL to lead a digitalisation project—the Montreux Sounds Digital Project—and store the data for the indefinite future.
Alexandre Delidais, director of operations and development at EPFL, said that the initial plan was to store the digitised video on tape. But, Delidais recognised that over such long periods of time, the hardware and software needed to access the tapes would become obsolete, so it was not a sustainable long-term option.
While transferring the digitised videos to tape for short- to medium-term archiving, EPFL explored options for more robust storage to accommodate the archive. The aim is to provide a live archive, accessible to the local community and EPFL students for research purposes, and a long-term archive that will amount to about 1.5 petabytes (PB).
Delidais met storage company Amplidata in 2010 and found the company's Optimized Object Storage (OOS) AS20 Storage Nodes offered the possibility of a live disk-based archive. OOS spreads objects over several storage nodes and can restore the original data in the event of multiple disk failures.
Amplidata's object storage technology, called BitSpread, is sold as an improvement to RAID that distributes and stores data redundantly across several disks, minimising the risk of a single disk failure affecting data. The system divides data objects into blocks that it converts to a larger number of check blocks, from which it can then reconstruct the original data. Amplidata describes it as like solving a Sudoku puzzle; once you fill enough fields, you can calculate the rest.
EPFL opted for object-based infrastructure because of its high levels of redundancy and reliability, as well as for its ability to store very large amounts of data that will be accessible using standards-based technology such as HTTP for the foreseeable future. And the college can expand the system as more data comes online. "We can manage the content for different usages, and it gives us high redundancy, security and access management," Delidais said. "We also keep a compressed version of the archive of about 200 TB on the Amplidata for use on a daily basis."
Delidais expanded on his storage criteria. "First, for our archive, we need high redundancy. OOS allows us to build redundancy into and optimise our storage. If one disk or server fails, we just replace it," he said.
"The second is security. We want to ensure the compressed version doesn't escape onto the Internet, and the Amplidata system allows us to segment the storage as we want and so control access. We foresee that there will be one segment for a secondary archive, another for online content in a low-res Web format and another for temporary usage," he said.
"The third requirement was speed. We need to be able to stream high-bit-rate video, which needs high throughput, peaking above 1 Gbps. Amplidata does this well compared to our existing RED5 NAS—also the RED5 is a disaster from a power perspective."