Data archives overview

News Analysis

Data archives overview

Stephen J. Bigelow, Features Writer
Under normal circumstances, a backup is simply a copy of data that is kept aside to protect against data loss -- when a file is lost due to user error, or data is corrupted because of system problems, the affected data can be restored from a backup. An archive is different from a backup because the data may not be used for months, even years, but must be accessed quickly when needed. This is further complicated by data archive sizes that are growing at an annual rate, up to 90% or more. There is simply no time to search through burgeoning volumes of tape or optical media to locate important files. Traditional backup platforms are poorly suited for archival data storage, and users are relying on disk storage systems for a mix of performance and reliability. Files can be archived to any disk storage system, but content-addressed storage (CAS) technology has appeared to support archiving efforts [see the Tech Closeup on CAS here].

Understanding CAS

At the simplest level, CAS is a specialized disk storage system. Since archival data is not accessed frequently, high-performance disks are not essential. In fact, most CAS platforms employ ordinary SATA hard disks for their low cost per gigabyte, though SAS disks may be used when added performance is needed to accommodate many simultaneous users. However, CAS technology incorporates a unique feature set designed to optimize storage space and improve long-term data management.

Storage All-In-One Guides
Learn more about storage topics like disk storage, disaster recovery, NAS, and more in's All-in-One Research Guides.
CAS technology extends the use of metadata to define a file. While any file may include mundane date, time, name or creator metadata, CAS allows a tremendous amount of additional information to be stored along with the file. Extended metadata can be essential for indexing and searching old data well into the future. For example, a physician could use metadata to search through patient files and retrieve X-rays from patients with a specific physical condition. Metadata and index/search features are also critical for meeting e-discovery or other litigation requests. Encryption techniques are sometimes employed to secure sensitive or confidential data.

Next, CAS data cannot be changed once it is archived. This ensures data integrity and prevents tampering or spoliation. A corporate regulatory audit or litigation discovery can proceed with high confidence that the data being examined is original and unaltered. Tamper-proofing is generally accomplished by treating files as objects with unique designators and locations. Since most archival data has a finite lifecycle, CAS also manages data retention and disposal in accordance with regulatory or compliance requirements. Data reaching its retention limit is systematically deleted.

One persistent problem with traditional file copies is the inevitable duplication of files. If there are 100 different copies of an e-mail file attachment, all 100 copies are saved in the backup. For long-term archival storage, this kind of inefficiency can quickly exhaust available storage space. Another real strength of CAS technology is in data deduplication (a.k.a. single-instance storage or intelligent compression), which eliminates duplicated blocks of data. Only one iteration of data is saved, and subsequent copies of the file are simply referenced back to the one saved copy. Consider a file-level example. If there are 100 attachments and each is 2 MB in size, archiving to CAS would only take 2 MB to save all 100 attachment references, instead of 200 MB with an ordinary disk system. Experts note that data deduplication can reduce data demands up to 50-to-1. Conventional compression techniques may also be employed to reduce disk space even further.

Power consumption is an important consideration. As CAS systems scale up to hundreds of spinning disks, the power cost becomes substantial. Some archive systems are employing creative solutions to reduce power demands such as idling drives or powering idle drives down completely. Low-power drives and emerging drive technologies like "hybrid drives" can also help lower overall power demands.

CAS products

Major vendors in the CAS market include -- in no particular order -- EMC Corp., Nexsan Technologies, Sun Microsystems Inc., StorageTek, Permabit Inc., Hewlett-Packard Co., Bycast Inc., IBM and Avamar Technologies Inc. Most CAS vendors possess a remarkably similar view of CAS, though each vendor puts its own unique stamp on the technology.

Storage Learning On-The-Go
Download this overview and listen on your iPod or laptop.
For most products, the main emphasis is on data deduplication where redundant pieces of information are eliminated to reduce the total archival storage requirements. EMC's Avamar product is particularly notable for this feature; breaking files into small blocks that Avamar called "atomics." When changes are made to a file, or a new file is archived, only the new/changed blocks are actually stored to disk. Reducing the total storage demands also speeds conventional backups or restorations because there is less data volume to transfer.

Once data is passed to CAS, it cannot change and must be protected against theft so other CAS products emphasize the immutability and security of archival data. Nexsan's Assureon product incorporated AES 256-bit encryption to protect files relegated to the archive. The Assureon also adds serialization to track the existence of each CAS location and prevent file tampering. Serialized locations can be scanned periodically to verify the integrity of each file, and any files that are damaged or incomplete can be dealt with promptly.

Still other CAS platforms embrace search and scalability features. Search capability relies on sophisticated metadata to help users to locate relevant file content long after the original file creator may have forgotten about it. Scalability is important to handle archival growth and handle huge numbers of CAS objects (into the billions) over the long term. EMC and Sun products both favor these areas.

Applications of CAS

CAS products are deployed in a wide variety of roles that extend well beyond archival storage to embrace backup/restoration, improve storage performance, meet regulatory requirements and save significant costs.

The data deduplication features of CAS platforms are sometimes used to reduce storage requirements. If corporate data can be concentrated into some fraction of its original space, backups and restorations (e.g. to tape or optical media) can be accomplished faster – simply because there is less data to transfer. Lower data volumes can also speed backup and replication tasks across WAN links to remote sites. Lost or damaged files can be restored directly from disk without the time or trouble of locating those files on other media.

CAS is sometimes selected over other long-term storage options to provide a superior user experience. For example, check images or X-ray data stored to tape or disc must often be retrieved manually once the corresponding media is located and loaded. It can take hours (even longer) for end users to obtain data stored on traditional media. A disk-based CAS system can keep that data nearline and supply files on demand without any manual intervention.

Email Alerts

Register now to receive IT-related news, guides and more, delivered to your inbox.
By submitting your personal information, you agree to receive emails regarding relevant products and special offers from TechTarget and its partners. You also agree that your personal information may be transferred and processed in the United States, and that you have read and agree to the Terms of Use and the Privacy Policy.

COMMENTS powered by Disqus  //  Commenting policy