Have you heard of a
zettabyte?
Whether you have or not, by 2010 we are going to be creating nearly
one zettabyte of data every year, according to analyst group IDC.
This is the equivalent of three million times all the information
in all the books ever written, or the same as a stack of books
reaching as far as the sun.
This is an alarming prospect and illustrates why every
IT professional
dealing with storage is concerned over how to store growing
amounts of data.
Now, if all this data was easily stored and recovered and able
to make a contribution to the business there would be little to
worry about.
But as much as
80% of this data is unstructured, which usually means the
business has no idea what is there. And because end-users do not
know what they have, it is not put to any use - the data is
effectively worthless.
So, where is all this data coming from? To begin with, there are
the kinds of documents many of us create every day. According to
analyst group Forrester there are 300 million Excel installations
worldwide, 200 million PDF documents on the web, and 100 million
Microsoft Office documents created every day.
Besides such general office files there are also vast quantities
of line-of-business data formats that resist being structured, such
as medical scans, mapping information, engineering drawings,
mortgage applications and new drug files.
At best, such volumes of unknown data - often duplicated many
times over - are simply a drain on storage resources. At worst they
can cost the business millions of pounds should the manner of
retention fail to meet compliance regulations.
At root it is a sheer waste of information, which costs money to
create and which could be put to good use. To begin to get to grips
with the potential problems of unstructured data you have to find
out exactly what you possess.
According to Dave Gingell, marketing vice president with
storage equipment supplier EMC, the main reason for discovery
and classification of unstructured data residing on file systems is
risk management. "Organisations want to understand what information
they are holding that could potentially lead to exposure to an
industry or government regulation and get it under control," he
says.
"An attendant and equally important driver is that of
information lifecycle management. If the data can be discovered
and classified, then the appropriate information infrastructure can
be utilised and the correct provisioning provided, based on the
value of the information asset," says Gingell.
Besides such "defensive" reasons for getting a grip on your
unstructured data, there are also benefits to be gained by being
able to utilise it to enrich structured information, says Rob
Karel, principal analyst with Forrester. "Today's datawarehouses
are built upon structured information from relational databases,
enterprise applications, and flat files generated from multiple
sources.
"The largest opportunity from bridging the structured and
unstructured information divide is creating richer information for
core business applications than structured data alone can
provide.
"Packaged enterprise applications such as
customer relationship management (CRM) systems and enterprise
resource planning (ERP) systems do not realise their full potential
today because important data maintained in unstructured
repositories is just too expensive to integrate," Karel says.
He adds, "More importantly, information and knowledge management
professionals are beginning to realise that users need and value
content much more when it is accessible contextually within the
business process, rather than searching for relevant content in a
separate, disconnected content system."
It is clear then that there are numerous benefits to
discovering, classifying and being able to use unstructured data.
But how do you begin to know what is there?
There are a range of products available offering features such
as discovery, classification, search, migration and transformation
capabilities. The discovery process identifies file and data types
in your infrastructure, while classification is applied to the
discovered data, creating metadata indices to each file and file
type based on a defined set of rules.
Search capabilities are the natural follow-on to classification,
as is the use of metadata to locate files based on criteria beyond
simple file names or creation dates. Search capability is
particularly important for archival or compliance purposes to aid
quick retrieval.
Data classification can also be coupled to an information
lifecycle management strategy, with the movement of data across the
storage infrastructure based on rules referenced in its metadata
(an expiry date, for example).
Transforming unstructured information to core application and
industry formats is another key feature of some tools. In such
cases the tools are able to transform data created in PDF, Excel or
other common formats into industry-specific file types.
Major ERP players such as SAP and Oracle are also working on
ways of bridging structured and unstructured information, and
analysts expect progress in the next year or so.
There are a large number of software providers working in this
area, creating products ranging from general discovery and
classification products, through to the specialised products
capable of migration and transformation.
Companies include Abrevity, Arkivio, Autonomy, Index Engines,
Kazeon, Scentric, StoredIQ and storage provider EMC.
EMC has incorporated products from acquisitions such as Smarts
and Documentum into its EMC Infoscape product family, which is used
for classifying and managing unstructured data.
Because of the technical language used by different industry
sectors, specialised lexicons often distinguish different
suppliers' products, and this is a key feature to examine during
procurement, says Greg Schultz, founder and senior analyst with
StoredIQ.
"Key features to watch out for include support for various
taxonomies or industry-specific lexicons, the ability to perform
deep or shallow discovery and classification, be transparent to
block or file system data and to be able to interact with various
storage systems, including those that are encrypted or compressed,"
Schultz says.
"Tools should work with each other, such as policy managers,
data movers and archiving products. For legal and compliance
purposes you should look for litigation hold, scheduled delete,
audit trails, flexible reporting and results export capabilities.
Also important is the ingest rate or speed at which documents can
be processed," he says.
But while the tools available provide an automated means of
dealing with unstructured data, such projects are not trivial and
they work from a difficult starting point in that most
organisations "do not know what they do not know" when it comes to
their unstructured data. That is, they are often unaware exactly
what information resides where and often are not aware of the risks
posed by storing it this way.
Initiating a project to deal with all the unstructured data in
an organisation should be a carefully crafted process. The business
drivers that are most effective in convincing senior management to
invest in such a project will be defined tightly around, for
example, a risk management or information lifecycle project.
Understanding all the unstructured information will be crucial in
either of these situations.
Schultz says, "Be clear on what it is that you are looking to
accomplish and why. For example, are you looking for shallow, basic
discovery and data classification - a step above file and data
resource management - or are you looking for the ability to search
deeper into files and documents to understand where and how
information is and has been used?
"Also be clear if you are looking for a document management
system, an archiving system, enterprise search, legal or compliance
discovery, or information and storage resource reporting."
Data retrieval
strategies >>
Web 2.0 can work
for storage >>
Abrevity >>
Arkivio >>
Autonomy >>
Index Engines
>>
Kazeon >>
Scentric >>
StoredIQ >>
EMC >>
Comment on this article:
computer.weekly@rbi.co.uk