If anyone knows aboutstoring multimedia files, it is Joanna
Kossuth. As the CIO of theFranklin W Olin College of
Engineeringin Massachusetts, she has to
support the needs of students and teachers in a school that adopts
a multidisciplinary approach to teaching.
"We teach engineering differently," she says, explaining that
one of her teachers is a molecular biologist and an exhibiting
artist. "Her students are all required to do videos for her class,
so we end up with huge file storage requirements.
"A key thing is for us to store those in a reasonable manner
without having to buy 5,000 servers to do it on. A second piece for
us was to have fast enough throughput on the system to avoid tying
up the network as a whole."
Things were easier back in the day when computers simply could
not handle many different types of data. If it was not a relational
database, a spreadsheet, or a text file, your luck was out. Now,
thanks to voice over IP, instant messaging, photo and video
applications, the need to store multimedia files is increasing, and
IT departments are coming under pressure to deliver.
Coping with a mixture of large file sizes and low-latency
requirements creates problems in areas such as indexing,
provisioning and speed of operation. These create issues in storage
and networking, which in a multimedia-storage system are
inextricably linked.
For Olin's students, storing multimedia files may be an
educational luxury, but for many, it is a legal necessity. Here,
compliance is an issue. "The general rule is that any organisation
should have a document and data-retention policy, but there is no
universal policy that suits every business," says Struan Robertson,
partner at IT law firm Pinsent Masons.
He says businesses have to examine the different reasons for
holding on to specific types of data. "These include providing
evidence for disputes, because in this country, you would want to
keep any documents relating to a contract for six years in case
someone brought a lawsuit in that period."
Are you scanning paper insurance claims into a document
management system? You would be well advised to keep that data if
it could be used as evidence later. Digitising X-ray photographs?
Patients may want to access them years later. And financial
institutions that record calls may well need access to those.
Even instant messaging traffic can be considered to have unique
storage requirements, says Michael Kilian, chief technology officer
of the Centera product arm of storage supplier
EMC.
"You have small quantities of data, but you have orders of
magnitude more files," he says. "Financial institutions using
instant messaging for customer broker interactions may have to keep
those for compliance purposes. It is not capacity that is the
problem there - it is the number of discrete items that you have to
keep track of."
Whether you are dealing with large video files or small instant
messaging files, indexing and metadata become a key part of a
storage system, says Jon Collins, service director at analyst firm
Freeform Dynamics. "It is all down to how good the index is,"
he says.
Trying to search by querying a set of multimedia files could
slow your system to a crawl. "You can perform indexing - a sort of
pre-search through those files. You could conduct some voice
recognition, and then store the information somewhere, and that
will then give you a reference into your data."
This is particularly useful if you want to archive your
multimedia files and rarely, if ever, use them, Collins says. "You
can search the index for what you need, and if it transpires, for
example, that a certain voice conversation is interesting, you can
go and look at it then."
The growing role of metadata
Metadata can take various forms, including basic information
about a file such as names of participants. However, advanced
indexing technologies that recognise content in video files are
available.
For example, Autonomy's Virage system processes video and audio
to make files searchable, says Autonomy chief executive, Mike
Lynch. "It is the index that gets hit a lot, so you must be careful
that the link to the index is not slow," he says. In Virage's case,
the index is a "probability lattice" - a collection of data that
provides a best guess as to the contents, and makes it
searchable.
For Kossuth, creating metadata is as much a user activity as an
IT task. "IT is not the best at defining what that metadata is,"
she says. "Students and faculty have a lot more time than we do in
IT to think about how they want to describe their data."
This is similar to the discussions of "folksonomies" that are
found in
Web 2.0 applications such as
YouTube and
Flickr. Users tag their own files - and sometimes one another's
- to create a bottom-up set of descriptions that sits in stark
contrast to the traditional top-down taxonomies that knowledge
management companies and archivists have often imposed upon
users.
One piece of metadata that Centera thinks is particularly
important is temporal data. In white papers, the firm places
particular emphasis on the last modification date. It is easy to
see the relevance of this data when storing large numbers of files
that you want to try to archive wherever possible.
Using indexing to bear the brunt of the data queries in a
multimedia environment could fit nicely with virtual tape libraries
designed to store infrequently accessed data on drives that are
spun down when not in use, making them available quickly -
certainly more quickly than if they were archived to tape.
"Disc spin-down is a big issue," says Lynch, who also provides a
virtual archiving service for users who need to store data such as
voice calls. "The content only gets hit very occasionally when
someone wants to read something, so if you have discs that can spin
down, then you save vast amounts of power."
Scalability challenge
Kilian says metadata in general will become very important, and
proposes content-addressable storage as a means of dealing with
file systems that must scale to large collections of individual
objects.
The idea of addressing a file as an opaque object via a uniform
resource identifier rather than as a file with a physical location
on a disc will become particularly important as the number of items
grows, he says. "The first use is to address the challenge 'how do
I store a billion voice messages without having to worry about what
my file system looks like?'."
Divorcing the physical location of a multimedia object from its
location in the file system could become a critical component of
storage virtualisation, which is becoming crucial to managing
large, volatile multimedia storage.
Kossuth could not have put together her IP-based storage system
without a virtualisation layer. "For us, having direct attached
storage was a nightmare," she says. Her multimedia storage system
consists of two storage arrays represented as a single volume.
Even with the indexing and other strategies, one challenge for
Kossuth is getting the data off the disc fast enough. In
applications where files have to be streamed with low latency, disc
speed is crucial.
"We needed drives that were fast enough Sans that when it came
to backing up to another disc and retrieving the images and the
information, it would not take a day and a half to do it," says
Kossuth. She started off with 7,200rpm disc drives, and has since
moved to 10,000rpm units.
Managing simultaneous access
Placing multimedia content on traditional storage media is a
problem, because drives and their file systems were not designed to
manage simultaneous access to large files. One or two users might
be fine, but IT departments may run into problems when scaling such
systems across large numbers of users.
Using indexes can help, but there are other strategies to speed
up disc access in heavy-usage environments. Isilon Systems uses
what director of product management Sam Grocott calls "clustered
storage" to help solve the problem.
The company lays its own software on top of industry-standard
hardware, physically striping large files such as video across
multiple storage devices. The system enables faster retrieval even
with standard 7,200rpm Sata drives, because it pulls portions of
the file from the array simultaneously. "Each node has 4Gbytes of
cache, so a 10-node cluster has a 40Gbyte cache," Grocott says. "We
can deliver 15,000rpm-like speeds."
One advantage of investing in the technology to support
multimedia storage is that it could then be used to garner other
benefits. For example, Kossuth is considering cross-campus storage
virtualisation.
Students are already cross-registered between a couple of the
schools, but she would like to enable students to access their
storage systems by logging in at a single location. "Basically, we
did virtualisation ourselves, and now, we are asking how we
virtualise and extend to communities that are not within our
group."
Solving the unique problems associated with storing multimedia
can therefore fortify your storage system against the more general
challenges to come. One thing is certain: as time goes on, data
volumes are unlikely to decrease.