Fortune 500 firm takes a crack at data classification

A life sciences company undertakes a data classification project with Abrevity and discovers the job is never done.

Data classification is "like eating an elephant," according to Michael Masterson, IT manager at a Fortune 500 life sciences company that's in the middle of a data classification project. "Don't get discouraged," he said. "You can't do it all at once."

Masterson's office has 60 Windows servers and a handful of Unix machines, plus the latest EMC Corp. Clariion CX3 array for primary storage and a Nexsan Technologies system for nearline, noncritical data. He's using EMC Documentum for document management and has 9 terabytes (TB) of unstructured data floating around unmanaged.

About a year ago, Masterson's company decided it needed to better understand the files it was storing before throwing any more disk into its data center. Unfortunately, Masterson said, this information isn't available in the metadata provided by Windows systems.

More data management info
Symantec makes major update to Enterprise Vault  

Zantaz buys data classification partner Singlecast  

EMC introduces data classification for files
  

Data classification is end users' job
"People have dumped stuff on me like I'm a landfill, but I'm not in the storage business," he noted. He is, however, responsible for ensuring that the company's scientists can find files months or even years after they've created them -- and with a recovery rate of minutes or hours, not days. Drug discovery is a competitive field, and it's heavily regulated by the Federal Drug Administration (FDA) and the Sarbanes-Oxley Act (SOX). "The risk of not managing these files is huge," Masterson said.

Masterson uses what he calls a "folksonomic" approach to data classification. Folksonomy is Internet parlance for tagging Web content on the fly to make it easily discoverable to users of that content. "People will not adapt consistently to one system ... it's human nature to be constantly reorganizing," he said, "and files are no different."

He's been piloting Abrevity Inc.'s FileData Classifier software for approximately one year and is impressed with its ability to work with legacy files and file systems, and to provide custom file classification and tagging. "It uses tags [that] users have already provided and words within the file system that they already understand," he said.

Aside from email and the usual Microsoft Office files, fluorescence-activated cell-sorting (FACS) files -- more commonly called instrument files -- make up much of the company's unstructured data. These are text files produced by flow cytometers, instruments used to measure microscopic particles in fluids. As the instruments become smarter, they crank out more data, all of which must be stored and managed. Analysts report that more dollars were spent last year for these types of instruments than for IT storage systems, and an order of magnitude more files were generated by them than by Microsoft Office or email users in most of these life sciences facilities.

Masterson notes that while other data classification tools (he looked at products offered by Arkivio Inc. and Kazeon Systems Inc.) are designed to extract known values from a single document and don't create indexes for multidocument searching, Abrevity's FileData Classifier can search and parse FACS headers, extract target data, tag files with new metadata for classification and then allow for policy-based management.

"Engineers nest folders within folders, so it's important to be able to search across these without having to open each file, which can take hours or days," he said.

More significantly, FileData Classifier offers context-based discovery rather than text searching using a proprietary database technology the vendor calls SLICEbase, instead of a relational database. This "speed[s] up searches tremendously," Masterson claimed. "They've got the right approach [to] preserving context."

Still, Masterson said that showing users how to tag files with a business value is an arduous task. To that end, he built a survey and created interview questions to find out which files are important given business and regulatory requirements. The secret is to keep classifications simple. "We have security and retention tags only. Don't get too complex with it and create slices that people will forget are even there," he advised. He also recommends creating a short list of the most important data -- files for a legal discovery case or human resource files, for example -- rather than trying to tag everything.

So far, Masterson has indexed about one-third of his office's unstructured content. His next step is to turn on policy automation to force the back end to move files to the right location.

"It will [take] a while for us to achieve nirvana," said Masterson. The dream is for users to tag files with the appropriate values when they save them. Ideally, this functionality will be built into the operating system, but for now the Abrevity tool is a good start, he said.

Read more on Privacy and data protection