Data classification is "like eating an elephant," according
to Michael Masterson, IT manager at a Fortune 500 life sciences
company that's in the middle of a data classification project.
"Don't get discouraged," he said. "You can't do it all at
once."Masterson's office has 60 Windows servers and a handful of Unix
machines, plus the latest EMC Corp. Clariion CX3 array for primary
storage and a Nexsan Technologies system for nearline, noncritical
data. He's using EMC Documentum for document management and has 9
terabytes (TB) of unstructured data floating around unmanaged.
About a year ago, Masterson's company decided it needed to
better understand the files it was storing before throwing any more
disk into its data center. Unfortunately, Masterson said, this
information isn't available in the metadata provided by Windows
systems.
"People have dumped stuff on me like I'm a landfill, but I'm not in
the storage business," he noted. He is, however, responsible for
ensuring that the company's scientists can find files months or
even years after they've created them -- and with a recovery rate
of minutes or hours, not days. Drug discovery is a competitive
field, and it's heavily regulated by the Federal Drug
Administration (FDA) and the Sarbanes-Oxley Act (SOX). "The risk of
not managing these files is huge," Masterson said.
Masterson uses what he calls a "folksonomic" approach to data
classification. Folksonomy is Internet parlance for tagging Web
content on the fly to make it easily discoverable to users of that
content. "People will not adapt consistently to one system ... it's
human nature to be constantly reorganizing," he said, "and files
are no different."
He's been piloting Abrevity Inc.'s FileData Classifier software
for approximately one year and is impressed with its ability to
work with legacy files and file systems, and to provide custom file
classification and tagging. "It uses tags [that] users have already
provided and words within the file system that they already
understand," he said.
Aside from email and the usual Microsoft Office files,
fluorescence-activated cell-sorting (FACS) files -- more commonly
called instrument files -- make up much of the company's
unstructured data. These are text files produced by flow
cytometers, instruments used to measure microscopic particles in
fluids. As the instruments become smarter, they crank out more
data, all of which must be stored and managed. Analysts report that
more dollars were spent last year for these types of instruments
than for IT storage systems, and an order of magnitude more files
were generated by them than by Microsoft Office or email users in
most of these life sciences facilities.
Masterson notes that while other data classification tools (he
looked at products offered by Arkivio Inc. and Kazeon Systems Inc.)
are designed to extract known values from a single document and
don't create indexes for multidocument searching, Abrevity's
FileData Classifier can search and parse FACS headers, extract
target data, tag files with new metadata for classification and
then allow for policy-based management.
"Engineers nest folders within folders, so it's important to be
able to search across these without having to open each file, which
can take hours or days," he said.
More significantly, FileData Classifier offers context-based
discovery rather than text searching using a proprietary database
technology the vendor calls SLICEbase, instead of a relational
database. This "speed[s] up searches tremendously," Masterson
claimed. "They've got the right approach [to] preserving
context."
Still, Masterson said that showing users how to tag files with a
business value is an arduous task. To that end, he built a survey
and created interview questions to find out which files are
important given business and regulatory requirements. The secret is
to keep classifications simple. "We have security and retention
tags only. Don't get too complex with it and create slices that
people will forget are even there," he advised. He also recommends
creating a short list of the most important data -- files for a
legal discovery case or human resource files, for example -- rather
than trying to tag everything.
So far, Masterson has indexed about one-third of his office's
unstructured content. His next step is to turn on policy automation
to force the back end to move files to the right location.
"It will [take] a while for us to achieve nirvana," said
Masterson. The dream is for users to tag files with the appropriate
values when they save them. Ideally, this functionality will be
built into the operating system, but for now the Abrevity tool is a
good start, he said.