Tip

Archiving unstructured data

Companies must find ways to automate and simplify the process of archiving files and e-mail messages. ECM software addresses this large pool of unstructured data.

What you will learn from this tip: Just archiving files and e-mail may not be enough to satisfy auditors; finding related information based on a variety of criteria may also be required.

The problem of indexing and archiving an organization's unstructured data is often swept under the rug. A typical response is to throw more hardware at the problem, but just adding more capacity to house data while ignoring its content no longer suffices. Regulators and legal professionals increasingly need to search and scrutinize unstructured data such as e-mail and file repositories, so companies must find ways to automate and simplify the process of identifying and inspecting archived files and e-mail messages. Add-ons to core enterprise content management (ECM) software, as well as specialized e-mail and file archiving programs, address this large pool of unstructured data, but they differ in how to process, discover, index and archive it. There's no complete solution available, so you'll likely have to make tradeoffs.

At a minimum, unstructured data archiving products must handle large volumes of data and meet compliance requirements in a cost-effective manner. Products from CommVault Systems Inc. and Zantaz Inc. minimize the time it takes to find particular e-mails from the archive pool. But these products generally lack the ability to build meaningful relationships among file contents, provide in-depth content analysis or create workflow processes -- all features that ECM software from companies like EMC Corp., Hummingbird Ltd. and Open Text Corp. provide.

Archiving software is generally easier to implement and better tailored for the high volume, low-cost nature of some unstructured data environments, while ECM applications offer more options to manage, classify and create relationships among data components. To decide which approach best suits your needs, you should understand how these programs manage unstructured data. Issues that should be considered include:

  • How is data discovery, indexing and archiving handled?
  • What type of meta data does the product create?
  • What type of content analysis is done by the product?
  • What default policies or categories are included?
  • Is e-mail and file meta data indexed in the same database?
  • How difficult is the application to install and maintain?
  • Are additional products or modules required to deliver the desired level of functionality?

Discovery and indexing

ZipLip Inc.'s Unified Email Archival Suite, an e-mail and file archiving product, accomplishes the discovery of e-mail by tapping the native journaling features in Exchange and Lotus. By using the applications' journals, ZipLip's product can capture the information without using its own agent, as well as intercept outgoing or incoming e-mail without the sender's or recipient's knowledge. ZipLip built the product's server component to run on a grid architecture because analysis and searching can be CPU and memory intensive. The architecture also provides a scalable, easy and low-cost way to grow. Stephen Chan, ZipLip's co-founder and vice president of business development, claims this design allows ZipLip to scan and analyze incoming or outgoing e-mail and create the meta data specified by policies with little or no interruption to the e-mail process. To expedite searching, ZipLip puts the index on a file server that's separate from the database and executes queries for data across the index, not the database.

Conversely, ECM apps are less useful for rapidly processing large amounts of unstructured data than for thoroughly analyzing, creating and storing meta data.

Most ECM products don't use journaling features on the messaging server or integrate with Exchange or Domino; archival and retrieval tasks are executed using the Exchange API (MAPI) and the Notes API over the Domino NRPC protocol. The downside of this approach is that these tasks will only run at scheduled times. Because the ECM software API call looks like a client to the mail server, the mail server needs to allocate processing time to manage and handle the requests. If the ECM software asks for 1,000 e-mail messages, it's probably not a problem; but if it asks for copies of all of the e-mails since the last request, it will likely slow the messaging server's performance.

Unlike its competitors, CommVault's QiNetix offers modules that integrate with applications or make API calls. For instance, the firm's DataArchiver for Exchange typifies the tight integration one normally finds in file archiving software because it places an agent on the Exchange server that copies the server's messages. However, other modules, such as CommVault's DataMigrator for Exchange and DataMigrator for Centera, make API calls. CommVault stores data from all sources in a central database, the common technology engine (CTE) that acts as a global catalog and indexes across the entire line of QiNetix products.

The meta data database

The data used to populate databases like CommVault's CTE is the meta data -- the data about the unstructured data -- that's generated during the analysis of each e-mail and file. The type of meta data will depend on the type of underlying content analysis tool used by the product and the policies currently in place.

The meta data includes common attributes such as file owner, creation date, last modified date, as well as the sender, receiver and subject line for e-mail messages. Content analysis also occurs during this stage, as a text-mining tool analyzes the content and context of the documents. For example, Zantaz's EAS uses AltaVista's indexing engine to open, examine and summarize the content of files and e-mails.

While indexing and archiving all unstructured data may satisfy regulators for now, it's a short-term fix. Auditors' demands are becoming more and more specific when examining the relationships among e-mails and files. For example:

Relationships between documents Users may be asked to retrieve all documents that are germane to a specific topic, accompanied by summaries that reflect both their content and the context in which the words are used. This may even mean being able to find relevant documents where a specific name or number isn't mentioned explicitly, but rather alluded to or implied. Doing so requires taxonomies found in ECM software and in advanced text-mining algorithms that employ techniques like lexical analysis, neural network-based intelligence and content scanning.

Policy-setting capabilities Searching and managing a data archive requires the ability to set and change policies. Policy capabilities should allow administrators to encrypt documents, quarantine documents for supervisory review and track when, how and who accessed a document. There should also be accommodation for some type of information lifecycle management mechanism that can either remove documents from the archive at the end of their regulatory life or recognize requirements to retain them for longer periods.

Creating an indexing policy

Creating policies so that unstructured data can be properly indexed, archived and retrieved is no longer an onerous task. Lubor Ptacek, EMC's director of marketing, found that many companies that purchased Documentum would create a task force to identify needed policies and categories. Often, the companies would get bogged down debating how to categorize and classify their data before ever using Documentum. @13321

To aid implementers, Documentum now comes with a set of default categories and policies. EMC recommends companies first address only a subset of their unstructured data. Policies can then be tuned over time to accommodate trends that emerge in the usage of the data or specific statutory requirements.

Many ECM and archiving products offer a data classification taxonomy in addition to a policy engine. The taxonomy provides other ways to classify data, such as by department, purpose or client, and data can be classified in multiple categories. The taxonomy makes it possible to find all documents pertinent to a specific subject without requiring multiple database queries.

A taxonomy doesn't lessen the importance of policy engines. For instance, a policy can be set to index all occurrences of a word. During content analysis, the contents of an e-mail or file are evaluated based on existing policies, with the results stored and indexed in the meta data database to enable fast searches on the words or phrases defined in the policies.

But looking only at specific words or phrases has its limitations. For instance, the word "football" may mean one thing to Americans and something else to the rest of the world. To actually understand the use of the word "football" in the context of an e-mail or file, natural language processing (NLP) algorithms are used to approximate what humans do -- analyze and interpret words within their context. Thus, the content analysis process would recognize that the word "football" could mean "soccer" as well as American football.

The ability to search for a word or phrase across all managed unstructured data sources from a single point is one clear advantage ECM products have over their archiving counterparts. Archiving software doesn't make API calls into other unstructured database repositories, which ECM apps do. Zantaz can only do this for unstructured data types that they integrate with, mainly at this point by using NAS heads. Documentum creates a single repository for all types of unstructured data and provides users with a common set of services to manage them. It then converts these different data types into objects and creates meta data associated with each object. Converting this data into objects also allows relationships to be defined among different objects and retained in the meta data repository. The downside of the ECM approach is that it can take a lot of time up front to define and build these relationships, and may require businesses to reengineer their existing processes. In some cases, the cost to set up and manage the unstructured data relationships may outstrip the gains.

Implementation considerations

Licensing also varies widely among these archiving products. For example, the ability to discover and index messages on e-mail servers requires a user to license Open Text's Livelink for E-mail Monitoring module, while another license for the company's E-mail Archiving component is required to manage and archive e-mail. Open Text's E-mail Management module license delivers all of this functionality in a single package.

Apps like Open Text's Livelink and Veritas Software Corp.'s Enterprise Vault communicate with e-mail servers through TCP/IP ports using standard APIs. For Exchange, they use MAPI and Lotus APIs for Lotus Notes. However, IBM Corp.'s DB2 CommonStore uses Notes-specific protocols, Notes RPC and Domino Internet Inter-ORB Protocol to extract Notes database information while using WebDAV, a set of HTTP extensions, to access public folders on Exchange.

Because these products use common network protocols to communicate with e-mail and file servers, administrators should check their internal network to ensure that the appropriate IP ports on the firewall are open or routes exist within the routing tables.

Unstructured data management tools provide a means of managing e-mail and file archives. Current products, however, lack a simple, comprehensive way to deliver an enterprise-level archiving solution. For now, organizations will have to rely on archiving point solutions. Companies ready to tackle longer term compliance issues and bring some data mining functionality to th

Read more on Storage management and strategy