How to conduct a successful data classification project

A successful data classification project tells you what data you have, how it's classified, and that it's stored and accessed efficiently. One key is to involve the business.

By Bryan Betts, Contributor

The lifeblood of any business is information. That's where data classification -- a precursor to information lifecycle management (ILM) -- comes in. By  conducting a successful data classification project, you can examine valuable information to provide business intelligence, encrypt and protect sensitive data against intrusion, match data to the appropriate  storage media in a tiered configuration, and archive or delete old data as appropriate.

"Information classification is pivotal to a sustainable information governance strategy," noted Jon Collins, managing director/CEO at analyst house Freeform Dynamics. "The majority of organisations acknowledge that their information classification capabilities are weak. Information cannot be adequately exploited and protected if there's no way of tracking its location, value and sensitivity to leakage."

He added, "The ability to classify information according to business criteria has multiple impact points, including dictating security, archiving, retention and destruction requirements. Without it, information cannot have a lifecycle."

Bridge the business-IT gap

An information or  data classification project is difficult to complete successfully. It involves bridging the gap between technical capabilities and business requirements, and usually requires the participation of business units, legal departments, compliance officers and other non-IT people.

"Most companies don't really understand data classification or see it as too expensive," said Clive Longbottom, service director for business processes facilitation at Quocirca. "It raises the questions that if there are 2,000 employees in a company all creating content, how do you capture and classify it all? What's meant by classification? What takes precedence and, once we've captured it, what do we do with it?"

It's no surprise that outside of niches such as compliance and  e-discovery, many data classification projects get "stuck" or even abandoned. But a whole raft of vendors now tout  software to do data classification for you -- even though data classification is fundamentally a business task, one of understanding the data's relative importance.

"Many of them are point solutions, looking at specific problems such as the UK Data Protection Act or US Food and Drug Administration approval," said Longbottom. "Others suffer from only dealing with specific subsets of information."

Longbottom recommends that all incoming data be captured before it hits disk and is classified on the run. By doing this, you can direct data to the right storage environment from the start, create a searchable index of content and optimise the data storage at the same time.

Such a solution plays to Quocirca's Compliance Oriented Architecture (COA) model, where, Longbottom said, "if all data assets are classified from the start, any new regulations, standards or internal audit requirements can be easily catered for without implementing yet another point solution."

Keep the data classification system simple

But such a solution doesn't address the mass of unclassified data already held by organisations. The key in that case, according to users who have made it work, is to keep the classification system as simple as possible. Per Ronnow Staffe, IT infrastructure designer at Danish pharmaceutical developer  Lundbeck, said for his company that simply meant the automated archiving of disused files.

"We had restrictions on the time to finish our ILM project and realised that classifying data by business value would take too long," he explained. "We would also have had problems finding all the data owners and asking them to categorise their data.

"So we chose not to categorise data except by when it was last accessed and modified -- we archive files not accessed in the last 18 months and move them to cheaper SATA storage," he said.

Lundbeck automated this process via a policy-driven file-moving appliance from Brocade called  File Management Engine (FME). Staffe said that in addition to saving space on primary storage, FME has cut the time needed to run regular backups by 30% because there's less data going to tape.

"I have been asked to calculate how much we have saved per GB. I've not done it yet, but I think it will be 50% or so. Fibre Channel storage is very expensive, but also when you archive the things that don't change, you need less backup and less administration," he said.

As a result of keeping it simple, Staffe said, "our ILM project is a huge success. This way we have had no complaints; nobody even realised we'd archived their data."

Some classification best applied "coarsely"

But there's still more that could be done. Staffe said Lundbeck has only classified the unstructured data on its file servers, not its databases or other structured data. "We aren't done yet with archiving because we also need to simplify our file structure for FME," he said.

And, Staffe noted, the business still wants a "value" category so it can delete some data altogether. "In my opinion, that's a huge project because you have to ask all the data owners to say what they don't need any more," he said, pointing out that they have little or no incentive to discard data because storage utilisation isn't something they're measured on.

"I can't imagine our scientists letting data go. They'd just move important data to USB drives," he said.

Jay Heiser, research vice president at Gartner, said that keeping it simple and not having too many classes or levels is key to the success of a data classification project. He points out that in, for example, a multilevel security system, each asset has a label. If you have too many levels, or if your scheme is excessively fine-grained, you end up with more metadata than you have data.

"The original concept is a good one," he said. "You could, for example, set five profiles or bands and have an appropriate technical and procedural model for each one." Heiser said some classification is best applied relatively coarsely, perhaps at the server, application, business unit or device level. For instance, rather than quizzing every laptop user to find out if they have data that requires encrypting, you simply assume that all laptops (being mobile devices) need encryption.

Identify relevant areas, define classification names

So what are the  key steps in conducting a successful data classification project? Quocirca's Longbottom said it's back to understanding the business and the environment (including legal and regulatory) it operates in.

He recommends first sitting down and identifying all of the areas -- legal, business value and risk -- where classification might be relevant.

"Then build a matrix of the various information assets and the rules that apply to each," he said. "Define what the primary, secondary and tertiary storage needs of each asset will be. Next, define the rules and different classification names that will apply to the assets. Finally, get technology that will enable the rules to be applied and followed, and that will allow policy to be changed to reflect what's needed out there in the real world."

But Freeform Dynamics' Collins warns that a data classification project won't be successful unless those policies, capabilities and knowledge are fully driven throughout the organisation.

"A direct correlation exists between appreciation of the need to control the flow of information, and the communication of this between senior management and the general workforce, as well as clear policies around information management and the volume of information governance breaches that are captured," he said.

Read more on Storage management and strategy