Data classification and identification is all about tagging
your data so it can be found quickly and efficiently.
But organisations can also gain from de-duplicating their
information, which helps to cut storage and backup costs, whilst
speeding up data searches.
Thirdly, classification can help an organisation to meet legal
and regulatory requirements for retrieving specific information
within a set timeframe, and this is often the motivation behind
implementing
data classification technology.
However, data strategies differ greatly from one organisation to
the next, as each generates different types and volumes of data.
The balance may vary greatly from one user to the next between
office documents, e-mail correspondence, images, video files,
customer and product information, financial data, and so on.
It may seem a good idea to classify and tag everything in the
databases, but experts warn against it.
Andy Whitton, partner in Deloitte's data practice says, "Full
data classification can be a very expensive activity that very few
organisations do well. Certified database technologies can tag
every data item however, in our experience only governments do this
because of the cost implications."
Instead, Whitton said, companies need to choose certain types of
data to classify, such as account data, personal data, or
commercially valuable data.
He added that the start point for most companies is to classify
data in line with their confidentiality requirements, adding more
security for increasingly confidential data. "If it goes wrong,
this could be the most externally damaging - and internally
sensitive. For example, everyone is very protective over salary
data," says Whitton.
As well as the type and confidentiality of the data,
organisations should also consider its integrity, as low-quality
data cannot be trusted. Users should also consider its
availability, because high data availability requires a resilient
storage and networking environment.
Tagging the data in the right way, by using an
effective metadata strategy , is essential, said Greg Keller,
chief evangelist at software firm Embarcadero. "In other words, the
egg must truly precede the chicken."
The enterprise is overwhelmed with data, including relational
(structured) and non-relational (semi-structured or
non-structured), much of which is redundant, stale and of radically
varying quality", he explains.
"A plan must be put in place, by an enterprise or data
architecture team, to first source the desired data, standardising
the path to it, documenting the data's structure and general
content along with any known business rules and then ultimately
communicating this initial set of information to relevant
constituencies."
Once this platform of initial "metadata" has been established
and replicated successfully to other information stores, the
organisation can implement a
"
classification taxonomy" to tag the assets of varying types, in
terms of their business relevance, said Keller.
"This set of tags can range from its quality classification, to
its encryption/security level, to its volatility," says Keller.
Best practices
Rich Hale, product and strategy manager at
Active Navigation, which develops tools to manage unstructured
information says there are several best practices to keep in mind
when undertaking a new data classification system.
According to Hale, it is imperative to have the support of the
management and employees who will be using the system. "Simply
using technology to automatically build a new scheme or designing a
scheme behind closed doors will lead to a dysfunctional approach
and adoption will be resisted."
The classification system itself must include an element of
centralised control so that data may be classified in the context
of overall strategic business objectives, such as compliance.
Secondly, before attempting to design a new classification
system, it is important to check that the data sets to be
classified and fed into the system are of good quality.
"A common problem with current information systems is that too
much rubbish is allowed to accumulate, from duplication to copies
of office party photos and personal letters to bank managers,
making the task difficult, at best," says Hale.
Storage cleansing products are useful here, because they remove
redundant, obsolete or trivial content.
The third step is to carry out an information audit, to gain an
accurate view of the nature of the data, including the dominant
themes, semantics or the gist of the information, and not just the
metadata.
The results of an audit then need to be placed in context with
the existing metadata information, as well as the details of where
and how the information has been stored, to give the richest
possible view of the content. Audit presentation technology can
help here, assisting classification designers to query, sift and
filter audit results rapidly.
The last stage of the data classification and identification
strategy is the classification design stage. Hale recommends that
users combine classification design tools with audit presentation,
which means that the audit results can be acted on: this ensures
the system is more effective.
Hale urges IT managers to look at technology which use the audit
themes and metadata to build a scheme, and then test that scheme
against selected data sets to determine how successful the
resulting classification will be.
Users should then be prepared to monitor and maintain the data
classification system. "Once a classification scheme is up and
running, it must not be considered to be set in stone - a process
for review and update, again involving users, is required to ensure
that adoption grows and that it continues to meet the changing
needs of the organisation," said Hale.
Tiered storage
Andy Holpin of independent consultancy Morse agrees that
continuous monitoring of the data classification system will keep
it sharp.
He also argues that when identifying and classifying data,
businesses should consider how the information will be physically
stored and categorised over its lifetime. This is because stock
lists, for example, soon go out of date, but cannot simply be
deleted because of compliance and archiving requirements.
"Businesses should, therefore, ensure they are continually
reviewing their data to keep it in the correct storage tier, moving
it from expensive high performance storage to cheaper offline
storage archiving over time," said Holpin.
Where the information is stored will come down to a number of
key factors, said Hale, including information cost, the required
retention period of data, whether information is business or
personal data, and whether it is business-critical.
"As part of a rigorous data classification and identification
strategy, tiered storage is a vital backbone to the system," said
Holpin. So, data that needs to be readily available is kept on
high-performance storage, while antiquated data is automatically
transferred onto lower-performance systems.
"This way the system can be kept running at peak efficiency for
users while still maintaining data for any compliance requirements
and keeping costs at a manageable level."
In a few years, data classification and identification systems
will use automated data provisioning, says Holpin, and this will
place data management in the hands of users.
For example, departments wanting to deploy new applications will
be able to provision their own storage by using management software
to assess for themselves how dependent the business is on the
application, and what level of performance and availability it
requires.
"However, in order for any of these technologies to lead to
success, businesses must realise that the strategy is not just
about point solutions. It needs to encompass processes and
procedures along with exciting-looking technology," comments
Holpin.
New technologies
Among the "exciting-looking" technologies that are emerging are
systems that can tag data by relating it to the employees who
created it.
As a result, traditional data can be categorised as well as
softer "knowledge-based" assets that store employee expertise.
Simon Price, UK director at enterprise search specialist
Recommind, says, "With social networking now firmly ingrained in
the public's conscience, businesses are beginning to look to
technologies that can bring their staff together, regardless of
geographic location."
Recommind produces "expertise location" software that can help
to quickly locate and access information about the person who has
the most relevant knowledge-set, often because they worked on a
similar project, task or job.
"This is the equivalent of a constantly and automatically
updating 'profile' for each staff member," says Price.
Document management is also evolving. Xerox is working on
Smarter Document Technology (SDT) which is specifically
designed to analyse and handle information in images and text,
whether digital, printed or handwritten.
SDT can understand and categorise text, and combine both text
and images, through a method called hybrid categorisation.
According to Xerox, this can be far more efficient than
categorising the two separately.
Another software firm, Autonomy, classifies and identifies
information using "meaning-based" computing.
The system collects indexed data and stores it in its
proprietary structure, optimised for fast processing and retrieval.
The information processing layer then forms a conceptual and
contextual understanding of all content in an enterprise,
automatically analysing over 1,000 different content formats, and
even people's interests.
Finally, there are tools such as EMC Infoscape, aimed at large
enterprises, which use content and metadata analysis, and
repository management, as well as discovery and data movement
technologies to track and classify data. It also allows users to
classify data based on importance, move it to a storage tier
according to predetermined policies, and manage its retention for
compliance.
When it comes to data classification and identification, there
is no lack of powerful software tools to assist. As has always been
the case, the success of the system comes down to the
implementation strategy.
Data classification - 10 top tips
1. Think twice about tagging and categorising everything - the
costs are high
2. Consider the confidentiality and security of the data to be
classified
3. Consider its integrity, as low-quality data cannot be
trusted
4. Look at its availability - high availability needs resilient
storage and networking
5. Use an effective metadata strategy to tag the data well
6. Get the support of the management and employees who will use
the system
7. Use
data cleansing technology to remove redundant, obsolete or
trivial content
8. Carry out an information audit, to gain an accurate view of
the nature of the data
9. Carry out classification design based on the data audit
results
10. Monitor and maintain the data classification system over
time, tweaking as necessary