Have you heard of a zettabyte? Whether you have or not, by 2010 we are going to be creating nearly one zettabyte of data every year, according to analyst group IDC. This is the equivalent of three million times all the information in all the books ever written, or the same as a stack of books reaching as far as the sun.
This is an alarming prospect and illustrates why every IT professional dealing with storage is concerned over how to store growing amounts of data.
Now, if all this data was easily stored and recovered and able to make a contribution to the business there would be little to worry about.
But as much as 80% of this data is unstructured, which usually means the business has no idea what is there. And because end-users do not know what they have, it is not put to any use - the data is effectively worthless.
So, where is all this data coming from? To begin with, there are the kinds of documents many of us create every day. According to analyst group Forrester there are 300 million Excel installations worldwide, 200 million PDF documents on the web, and 100 million Microsoft Office documents created every day.
Besides such general office files there are also vast quantities of line-of-business data formats that resist being structured, such as medical scans, mapping information, engineering drawings, mortgage applications and new drug files.
At best, such volumes of unknown data - often duplicated many times over - are simply a drain on storage resources. At worst they can cost the business millions of pounds should the manner of retention fail to meet compliance regulations.
At root it is a sheer waste of information, which costs money to create and which could be put to good use. To begin to get to grips with the potential problems of unstructured data you have to find out exactly what you possess.
According to Dave Gingell, marketing vice president with storage equipment supplier EMC, the main reason for discovery and classification of unstructured data residing on file systems is risk management. "Organisations want to understand what information they are holding that could potentially lead to exposure to an industry or government regulation and get it under control," he says.
"An attendant and equally important driver is that of information lifecycle management. If the data can be discovered and classified, then the appropriate information infrastructure can be utilised and the correct provisioning provided, based on the value of the information asset," says Gingell.
Besides such "defensive" reasons for getting a grip on your unstructured data, there are also benefits to be gained by being able to utilise it to enrich structured information, says Rob Karel, principal analyst with Forrester. "Today's datawarehouses are built upon structured information from relational databases, enterprise applications, and flat files generated from multiple sources.
"The largest opportunity from bridging the structured and unstructured information divide is creating richer information for core business applications than structured data alone can provide.
"Packaged enterprise applications such as customer relationship management (CRM) systems and enterprise resource planning (ERP) systems do not realise their full potential today because important data maintained in unstructured repositories is just too expensive to integrate," Karel says.
He adds, "More importantly, information and knowledge management professionals are beginning to realise that users need and value content much more when it is accessible contextually within the business process, rather than searching for relevant content in a separate, disconnected content system."
It is clear then that there are numerous benefits to discovering, classifying and being able to use unstructured data. But how do you begin to know what is there?
There are a range of products available offering features such as discovery, classification, search, migration and transformation capabilities. The discovery process identifies file and data types in your infrastructure, while classification is applied to the discovered data, creating metadata indices to each file and file type based on a defined set of rules.
Search capabilities are the natural follow-on to classification, as is the use of metadata to locate files based on criteria beyond simple file names or creation dates. Search capability is particularly important for archival or compliance purposes to aid quick retrieval.
Data classification can also be coupled to an information lifecycle management strategy, with the movement of data across the storage infrastructure based on rules referenced in its metadata (an expiry date, for example).
Transforming unstructured information to core application and industry formats is another key feature of some tools. In such cases the tools are able to transform data created in PDF, Excel or other common formats into industry-specific file types.
Major ERP players such as SAP and Oracle are also working on ways of bridging structured and unstructured information, and analysts expect progress in the next year or so.
There are a large number of software providers working in this area, creating products ranging from general discovery and classification products, through to the specialised products capable of migration and transformation.
Companies include Abrevity, Arkivio, Autonomy, Index Engines, Kazeon, Scentric, StoredIQ and storage provider EMC. EMC has incorporated products from acquisitions such as Smarts and Documentum into its EMC Infoscape product family, which is used for classifying and managing unstructured data.
Because of the technical language used by different industry sectors, specialised lexicons often distinguish different suppliers' products, and this is a key feature to examine during procurement, says Greg Schultz, founder and senior analyst with StoredIQ.
"Key features to watch out for include support for various taxonomies or industry-specific lexicons, the ability to perform deep or shallow discovery and classification, be transparent to block or file system data and to be able to interact with various storage systems, including those that are encrypted or compressed," Schultz says.
"Tools should work with each other, such as policy managers, data movers and archiving products. For legal and compliance purposes you should look for litigation hold, scheduled delete, audit trails, flexible reporting and results export capabilities. Also important is the ingest rate or speed at which documents can be processed," he says.
But while the tools available provide an automated means of dealing with unstructured data, such projects are not trivial and they work from a difficult starting point in that most organisations "do not know what they do not know" when it comes to their unstructured data. That is, they are often unaware exactly what information resides where and often are not aware of the risks posed by storing it this way.
Initiating a project to deal with all the unstructured data in an organisation should be a carefully crafted process. The business drivers that are most effective in convincing senior management to invest in such a project will be defined tightly around, for example, a risk management or information lifecycle project. Understanding all the unstructured information will be crucial in either of these situations.
Schultz says, "Be clear on what it is that you are looking to accomplish and why. For example, are you looking for shallow, basic discovery and data classification - a step above file and data resource management - or are you looking for the ability to search deeper into files and documents to understand where and how information is and has been used?
"Also be clear if you are looking for a document management system, an archiving system, enterprise search, legal or compliance discovery, or information and storage resource reporting."
Comment on this article: email@example.com