Data classification and identification is all about tagging your data so it can be found quickly and efficiently.
But organisations can also gain from de-duplicating their information, which helps to cut storage and backup costs, whilst speeding up data searches.
Thirdly, classification can help an organisation to meet legal and regulatory requirements for retrieving specific information within a set timeframe, and this is often the motivation behind implementing data classification technology.
However, data strategies differ greatly from one organisation to the next, as each generates different types and volumes of data. The balance may vary greatly from one user to the next between office documents, e-mail correspondence, images, video files, customer and product information, financial data, and so on.
It may seem a good idea to classify and tag everything in the databases, but experts warn against it.
Andy Whitton, partner in Deloitte's data practice says, "Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item however, in our experience only governments do this because of the cost implications."
Instead, Whitton said, companies need to choose certain types of data to classify, such as account data, personal data, or commercially valuable data.
He added that the start point for most companies is to classify data in line with their confidentiality requirements, adding more security for increasingly confidential data. "If it goes wrong, this could be the most externally damaging - and internally sensitive. For example, everyone is very protective over salary data," says Whitton.
As well as the type and confidentiality of the data, organisations should also consider its integrity, as low-quality data cannot be trusted. Users should also consider its availability, because high data availability requires a resilient storage and networking environment.
Tagging the data in the right way, by using an effective metadata strategy , is essential, said Greg Keller, chief evangelist at software firm Embarcadero. "In other words, the egg must truly precede the chicken."
The enterprise is overwhelmed with data, including relational (structured) and non-relational (semi-structured or non-structured), much of which is redundant, stale and of radically varying quality", he explains.
"A plan must be put in place, by an enterprise or data architecture team, to first source the desired data, standardising the path to it, documenting the data's structure and general content along with any known business rules and then ultimately communicating this initial set of information to relevant constituencies."
Once this platform of initial "metadata" has been established and replicated successfully to other information stores, the organisation can implement a "classification taxonomy" to tag the assets of varying types, in terms of their business relevance, said Keller.
"This set of tags can range from its quality classification, to its encryption/security level, to its volatility," says Keller.
Rich Hale, product and strategy manager at Active Navigation, which develops tools to manage unstructured information says there are several best practices to keep in mind when undertaking a new data classification system.
According to Hale, it is imperative to have the support of the management and employees who will be using the system. "Simply using technology to automatically build a new scheme or designing a scheme behind closed doors will lead to a dysfunctional approach and adoption will be resisted."
The classification system itself must include an element of centralised control so that data may be classified in the context of overall strategic business objectives, such as compliance.
Secondly, before attempting to design a new classification system, it is important to check that the data sets to be classified and fed into the system are of good quality.
"A common problem with current information systems is that too much rubbish is allowed to accumulate, from duplication to copies of office party photos and personal letters to bank managers, making the task difficult, at best," says Hale.
Storage cleansing products are useful here, because they remove redundant, obsolete or trivial content.
The third step is to carry out an information audit, to gain an accurate view of the nature of the data, including the dominant themes, semantics or the gist of the information, and not just the metadata.
The results of an audit then need to be placed in context with the existing metadata information, as well as the details of where and how the information has been stored, to give the richest possible view of the content. Audit presentation technology can help here, assisting classification designers to query, sift and filter audit results rapidly.
The last stage of the data classification and identification strategy is the classification design stage. Hale recommends that users combine classification design tools with audit presentation, which means that the audit results can be acted on: this ensures the system is more effective.
Hale urges IT managers to look at technology which use the audit themes and metadata to build a scheme, and then test that scheme against selected data sets to determine how successful the resulting classification will be.
Users should then be prepared to monitor and maintain the data classification system. "Once a classification scheme is up and running, it must not be considered to be set in stone - a process for review and update, again involving users, is required to ensure that adoption grows and that it continues to meet the changing needs of the organisation," said Hale.
Andy Holpin of independent consultancy Morse agrees that continuous monitoring of the data classification system will keep it sharp.
He also argues that when identifying and classifying data, businesses should consider how the information will be physically stored and categorised over its lifetime. This is because stock lists, for example, soon go out of date, but cannot simply be deleted because of compliance and archiving requirements.
"Businesses should, therefore, ensure they are continually reviewing their data to keep it in the correct storage tier, moving it from expensive high performance storage to cheaper offline storage archiving over time," said Holpin.
Where the information is stored will come down to a number of key factors, said Hale, including information cost, the required retention period of data, whether information is business or personal data, and whether it is business-critical.
"As part of a rigorous data classification and identification strategy, tiered storage is a vital backbone to the system," said Holpin. So, data that needs to be readily available is kept on high-performance storage, while antiquated data is automatically transferred onto lower-performance systems.
"This way the system can be kept running at peak efficiency for users while still maintaining data for any compliance requirements and keeping costs at a manageable level."
In a few years, data classification and identification systems will use automated data provisioning, says Holpin, and this will place data management in the hands of users.
For example, departments wanting to deploy new applications will be able to provision their own storage by using management software to assess for themselves how dependent the business is on the application, and what level of performance and availability it requires.
"However, in order for any of these technologies to lead to success, businesses must realise that the strategy is not just about point solutions. It needs to encompass processes and procedures along with exciting-looking technology," comments Holpin.
Among the "exciting-looking" technologies that are emerging are systems that can tag data by relating it to the employees who created it.
As a result, traditional data can be categorised as well as softer "knowledge-based" assets that store employee expertise.
Simon Price, UK director at enterprise search specialist Recommind, says, "With social networking now firmly ingrained in the public's conscience, businesses are beginning to look to technologies that can bring their staff together, regardless of geographic location."
Recommind produces "expertise location" software that can help to quickly locate and access information about the person who has the most relevant knowledge-set, often because they worked on a similar project, task or job.
"This is the equivalent of a constantly and automatically updating 'profile' for each staff member," says Price.
Document management is also evolving. Xerox is working on Smarter Document Technology (SDT) which is specifically designed to analyse and handle information in images and text, whether digital, printed or handwritten.
SDT can understand and categorise text, and combine both text and images, through a method called hybrid categorisation. According to Xerox, this can be far more efficient than categorising the two separately.
Another software firm, Autonomy, classifies and identifies information using "meaning-based" computing.
The system collects indexed data and stores it in its proprietary structure, optimised for fast processing and retrieval. The information processing layer then forms a conceptual and contextual understanding of all content in an enterprise, automatically analysing over 1,000 different content formats, and even people's interests.
Finally, there are tools such as EMC Infoscape, aimed at large enterprises, which use content and metadata analysis, and repository management, as well as discovery and data movement technologies to track and classify data. It also allows users to classify data based on importance, move it to a storage tier according to predetermined policies, and manage its retention for compliance.
When it comes to data classification and identification, there is no lack of powerful software tools to assist. As has always been the case, the success of the system comes down to the implementation strategy.
Data classification - 10 top tips
1. Think twice about tagging and categorising everything - the costs are high
2. Consider the confidentiality and security of the data to be classified
3. Consider its integrity, as low-quality data cannot be trusted
4. Look at its availability - high availability needs resilient storage and networking
5. Use an effective metadata strategy to tag the data well
6. Get the support of the management and employees who will use the system
7. Use data cleansing technology to remove redundant, obsolete or trivial content
8. Carry out an information audit, to gain an accurate view of the nature of the data
9. Carry out classification design based on the data audit results
10. Monitor and maintain the data classification system over time, tweaking as necessary
This was first published in September 2008