kentoh - Fotolia

Data classification: why it is important and how to do it

Organisations have a lot to gain from data classification and identification, but there are a few boxes to tick to make sure it’s being done right

Data classification and identification is all about tagging your data so it can be found quickly and efficiently.

But organisations can also gain from de-duplicating their information, which helps to cut storage and backup costs, while speeding up data searches.

Classification can help an organisation to meet legal and regulatory requirements for retrieving specific information in a set timeframe, and this is often the motivation behind implementing data classification technology.

However, data strategies differ greatly from one organisation to the next, as each generates different types and volumes of data. The balance may vary greatly from one user to the next between office documents, email correspondence, images, video files, customer and product information, financial data, and so on.

It may seem a good idea to classify and tag everything in the databases, but experts warn against it.

Andy Whitton, partner in Deloitte’s data practice, says: “Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item but, in our experience, only governments do this because of the cost implications.”

Instead, Whitton says, companies need to choose certain types of data to classify, such as account data, personal data, or commercially valuable data.

He adds that the start point for most companies is to classify data in line with their confidentiality requirements, adding more security for increasingly confidential data.

“If it goes wrong, this could be the most externally damaging – and internally sensitive. For example, everyone is very protective over salary data,” says Whitton.

As well as the type and confidentiality of the data, organisations should also consider its integrity, as low-quality data cannot be trusted. Users should also consider its availability, because high data availability requires a resilient storage and networking environment.

Tagging the data in the right way, by using an effective metadata strategy, is essential, says Greg Keller, chief evangelist at software firm Embarcadero. “In other words, the egg must truly precede the chicken,” he adds.

Read more about data classifcation

  • Data classification services from CSPs are important for organisations strengthening their cloud security posture. Expert Dave Shackleford explains the perks of these services.
  • Most data retained by organisations is not identified or classified and gobbles budget spent on storage, as well as being potentially non-compliant, reveals Veritas-sponsored survey.
  • Businesses have invested heavily in information governance and security, and embracing three new data classifications could prove beneficial in 2016.

The enterprise is overwhelmed with data, including relational (structured) and non-relational (semi-structured or non-structured), much of which is redundant, stale and of radically varying quality, says Keller.

“A plan must be put in place, by an enterprise or data architecture team, to first source the desired data, standardising the path to it, documenting the data’s structure and general content along with any known business rules, and then ultimately communicating this initial set of information to relevant constituencies,” he says.

Once this platform of initial “metadata” has been established and replicated successfully to other information stores, the organisation can implement a “classification taxonomy” to tag the assets of varying types, in terms of their business relevance, says Keller.

“This set of tags can range from its quality classification, to its encryption/security level, to its volatility,” he says.

Best practices

Rich Hale, product and strategy manager at Active Navigation, which develops tools to manage unstructured information, says there are several best practices to keep in mind when undertaking a data classification system.

According to Hale, it is imperative to have the support of the management and employees who will be using the system. “Simply using technology to automatically build a new scheme or designing a scheme behind closed doors will lead to a dysfunctional approach and adoption will be resisted,” he says.

The classification system itself must include an element of centralised control so that data may be classified in the context of overall strategic business objectives, such as compliance.

Secondly, before attempting to design a new classification system, it is important to check that the data sets to be classified and fed into the system are of good quality.

“A common problem with current information systems is that too much rubbish is allowed to accumulate, from duplication to copies of office party photos and personal letters to bank managers, making the task difficult, at best,” says Hale.

Storage cleansing products are useful here, because they remove redundant, obsolete or trivial content.

The third step is to carry out an information audit, to gain an accurate view of the nature of the data, including the dominant themes, semantics or the gist of the information, and not just the metadata.

The results of an audit then need to be placed in context with the existing metadata information, as well as the details of where and how the information has been stored, to give the richest possible view of the content. Audit presentation technology can help here, assisting classification designers to query, sift and filter audit results rapidly.

The last stage of the data classification and identification strategy is the classification design stage. Hale recommends that users combine classification design tools with audit presentation, which means that the audit results can be acted on and ensures the system is more effective.

Hale urges IT managers to look at technology which use the audit themes and metadata to build a scheme, and then test that scheme against selected data sets to determine how successful the resulting classification will be.

Users should then be prepared to monitor and maintain the data classification system. “Once a classification scheme is up and running, it must not be considered to be set in stone – a process for review and update, again involving users, is required to ensure that adoption grows and that it continues to meet the changing needs of the organisation,” said Hale.

Tiered storage

Andy Holpin of independent consultancy Morse agrees that continuous monitoring of the data classification system will keep it sharp.

He also argues that when identifying and classifying data, businesses should consider how the information will be physically stored and categorised over its lifetime.

This is because stock lists, for example, soon go out of date, but cannot simply be deleted because of compliance and archiving requirements.

“Businesses should, therefore, ensure they are continually reviewing their data to keep it in the correct storage tier, moving it from expensive high performance storage to cheaper offline storage archiving over time,” says Holpin.

Where the information is stored will come down to a number of key factors, says Active Navigation’s Hale, including information cost, the required retention period of data, whether information is business or personal data, and whether it is business critical.

“As part of a rigorous data classification and identification strategy, tiered storage is a vital backbone to the system,” says Holpin, meaning data that needs to be readily available is kept on high-performance storage, while antiquated data is automatically transferred onto lower-performance systems.

“This way the system can be kept running at peak efficiency for users while still maintaining data for any compliance requirements and keeping costs at a manageable level.”

In a few years, data classification and identification systems will use automated data provisioning, says Holpin, and this will place data management in the hands of users.

For example, departments wanting to deploy new applications will be able to provision their own storage by using management software to assess for themselves how dependent the business is on the application, and what level of performance and availability it requires.

“However, for any of these technologies to lead to success, businesses must realise that the strategy is not just about point solutions. It needs to encompass processes and procedures along with exciting-looking technology,” says Holpin.

New technologies

Among the exciting-looking technologies that are emerging are systems that can tag data by relating it to the employees who created it.

As a result, traditional data can be categorised as well as softer knowledge-based assets that store employee expertise.

Simon Price, UK director at enterprise search specialist Recommind, says: “With social networking now firmly ingrained in the public’s conscience, businesses are beginning to look to technologies that can bring their staff together, regardless of geographic location.”

Recommind produces expertise location software that can help to quickly locate and access information about the person who has the most relevant knowledge-set, often because they worked on a similar project, task or job.

“This is the equivalent of a constantly and automatically updating profile for each staff member,” says Price.

Document management is also evolving. Xerox is working on smarter document technology (SDT), which is specifically designed to analyse and handle information in images and text, whether digital, printed or handwritten.

SDT can understand and categorise text, and combine both text and images, through a method called hybrid categorisation. According to Xerox, this can be far more efficient than categorising the two separately.

Another software firm, Autonomy, classifies and identifies information using meaning-based computing.

The system collects indexed data and stores it in its proprietary structure, optimised for fast processing and retrieval. The information processing layer then forms a conceptual and contextual understanding of all content in an enterprise, automatically analysing more than 1,000 different content formats, and even people’s interests.

Finally, there are tools such as EMC Infoscape, aimed at large enterprises, which use content and metadata analysis, and repository management, as well as discovery and data movement technologies to track and classify data.

It also allows users to classify data based on importance, move it to a storage tier according to predetermined policies, and manage its retention for compliance.

When it comes to data classification and identification, there is no lack of powerful software tools to assist. As has always been the case, the success of the system comes down to the implementation strategy.

Data classification: 10 top tips

  1. Think twice about tagging and categorising everything as the costs are high.
  2. Consider the confidentiality and security of the data to be classified.
  3. Consider its integrity, as low-quality data cannot be trusted.
  4. Look at its availability – high availability needs resilient storage and networking.
  5. Use an effective metadata strategy to tag the data well.
  6. Get the support of the management and employees who will use the system.
  7. Use data cleansing technology to remove redundant, obsolete or trivial content.
  8. Carry out an information audit to gain an accurate view of the nature of the data.
  9. Carry out classification design based on the data audit results.
  10. Monitor and maintain the data classification system over time, tweaking as necessary.

Read more on Content management