Get to grips with multimedia storage

Multimedia storage requires the smartest use of technology and tagging.

If anyone knows about storing multimedia files, it is Joanna Kossuth. As the CIO of the Franklin W Olin College of Engineering in Massachusetts, she has to support the needs of students and teachers in a school that adopts a multidisciplinary approach to teaching.

"We teach engineering differently," she says, explaining that one of her teachers is a molecular biologist and an exhibiting artist. "Her students are all required to do videos for her class, so we end up with huge file storage requirements.

"A key thing is for us to store those in a reasonable manner without having to buy 5,000 servers to do it on. A second piece for us was to have fast enough throughput on the system to avoid tying up the network as a whole."

Things were easier back in the day when computers simply could not handle many different types of data. If it was not a relational database, a spreadsheet, or a text file, your luck was out. Now, thanks to voice over IP, instant messaging, photo and video applications, the need to store multimedia files is increasing, and IT departments are coming under pressure to deliver.

Coping with a mixture of large file sizes and low-latency requirements creates problems in areas such as indexing, provisioning and speed of operation. These create issues in storage and networking, which in a multimedia-storage system are inextricably linked.

For Olin's students, storing multimedia files may be an educational luxury, but for many, it is a legal necessity. Here, compliance is an issue. "The general rule is that any organisation should have a document and data-retention policy, but there is no universal policy that suits every business," says Struan Robertson, partner at IT law firm Pinsent ­Masons.

He says businesses have to examine the different reasons for holding on to specific types of data. "These include providing evidence for disputes, because in this country, you would want to keep any documents relating to a contract for six years in case someone brought a lawsuit in that period."

Are you scanning paper insurance claims into a document management system? You would be well advised to keep that data if it could be used as evidence later. Digitising X-ray photographs? Patients may want to access them years later. And financial institutions that record calls may well need access to those.

Even instant messaging traffic can be considered to have unique storage requirements, says Michael Kilian, chief technology officer of the Centera product arm of storage supplier EMC.

"You have small quantities of data, but you have orders of magnitude more files," he says. "Financial institutions using instant messaging for customer broker interactions may have to keep those for compliance purposes. It is not capacity that is the problem there - it is the number of discrete items that you have to keep track of."

Whether you are dealing with large video files or small instant messaging files, indexing and metadata become a key part of a storage system, says Jon Collins, service director at analyst firm Freeform Dynamics. "It is all down to how good the index is," he says.

Trying to search by querying a set of multimedia files could slow your system to a crawl. "You can perform indexing - a sort of pre-search through those files. You could conduct some voice recognition, and then store the information somewhere, and that will then give you a reference into your data."

This is particularly useful if you want to archive your multimedia files and rarely, if ever, use them, Collins says. "You can search the index for what you need, and if it transpires, for example, that a certain voice conversation is interesting, you can go and look at it then."

The growing role of metadata

Metadata can take various forms, including basic information about a file such as names of participants. However, advanced indexing technologies that recognise content in video files are available.

For example, Autonomy's Virage system processes video and audio to make files searchable, says Autonomy chief executive, Mike Lynch. "It is the index that gets hit a lot, so you must be careful that the link to the index is not slow," he says. In Virage's case, the index is a "probability lattice" - a collection of data that provides a best guess as to the contents, and makes it searchable.

For Kossuth, creating metadata is as much a user activity as an IT task. "IT is not the best at defining what that metadata is," she says. "Students and faculty have a lot more time than we do in IT to think about how they want to describe their data."

This is similar to the discussions of "folksonomies" that are found in Web 2.0 applications such as YouTube and Flickr. Users tag their own files - and sometimes one another's - to create a bottom-up set of descriptions that sits in stark contrast to the traditional top-down taxonomies that knowledge management companies and archivists have often imposed upon users.

One piece of metadata that Centera thinks is particularly important is temporal data. In white papers, the firm places particular emphasis on the last modification date. It is easy to see the relevance of this data when storing large numbers of files that you want to try to archive wherever possible.

Using indexing to bear the brunt of the data queries in a multimedia environment could fit nicely with virtual tape libraries designed to store infrequently accessed data on drives that are spun down when not in use, making them available quickly - certainly more quickly than if they were archived to tape.

"Disc spin-down is a big issue," says Lynch, who also provides a virtual archiving service for users who need to store data such as voice calls. "The content only gets hit very occasionally when someone wants to read something, so if you have discs that can spin down, then you save vast amounts of power."

Scalability challenge

Kilian says metadata in general will become very important, and proposes content-addressable storage as a means of dealing with file systems that must scale to large collections of individual objects.

The idea of addressing a file as an opaque object via a uniform resource identifier rather than as a file with a physical location on a disc will become particularly important as the number of items grows, he says. "The first use is to address the challenge 'how do I store a billion voice messages without having to worry about what my file system looks like?'."

Divorcing the physical location of a multimedia object from its location in the file system could become a critical component of storage virtualisation, which is becoming crucial to managing large, volatile multimedia storage.

Kossuth could not have put together her IP-based storage system without a virtualisation layer. "For us, having direct attached storage was a nightmare," she says. Her multimedia storage system consists of two storage arrays represented as a single volume.

Even with the indexing and other strategies, one challenge for Kossuth is getting the data off the disc fast enough. In applications where files have to be streamed with low latency, disc speed is crucial.

"We needed drives that were fast enough Sans that when it came to backing up to another disc and retrieving the images and the information, it would not take a day and a half to do it," says Kossuth. She started off with 7,200rpm disc drives, and has since moved to 10,000rpm units.

Managing simultaneous access

Placing multimedia content on traditional storage media is a problem, because drives and their file systems were not designed to manage simultaneous access to large files. One or two users might be fine, but IT departments may run into problems when scaling such systems across large numbers of users.

Using indexes can help, but there are other strategies to speed up disc access in heavy-usage environments. Isilon Systems uses what director of product management Sam Grocott calls "clustered storage" to help solve the problem.

The company lays its own software on top of industry-standard hardware, physically striping large files such as video across multiple storage devices. The system enables faster retrieval even with standard 7,200rpm Sata drives, because it pulls portions of the file from the array simultaneously. "Each node has 4Gbytes of cache, so a 10-node cluster has a 40Gbyte cache," Grocott says. "We can deliver 15,000rpm-like speeds."

One advantage of investing in the technology to support multimedia storage is that it could then be used to garner other benefits. For example, Kossuth is considering cross-campus storage virtualisation.

Students are already cross-registered between a couple of the schools, but she would like to enable students to access their storage systems by logging in at a single location. "Basically, we did virtualisation ourselves, and now, we are asking how we virtualise and extend to communities that are not within our group."

Solving the unique problems associated with storing multimedia can therefore fortify your storage system against the more general challenges to come. One thing is certain: as time goes on, data volumes are unlikely to decrease.

Read more on Integration software and middleware