Under normal circumstances, a
backup is simply a copy of data that is kept
aside to protect against data loss -- when a file is lost due to
user error, or data is corrupted because of system problems, the
affected data can be restored from a backup. An
archive is different from a backup because
the data may not be used for months, even years, but must be
accessed quickly when needed. This is further complicated by
data archive sizes that are growing at an annual rate, up to 90%
or more. There is simply no time to search through burgeoning
volumes of
tape or
optical media to locate important files.
Traditional backup platforms are poorly suited for archival data
storage, and users are relying on
disk storage systems for a mix of
performance and reliability. Files can be archived to any disk
storage system, but
content-addressed storage (CAS) technology
has appeared to support archiving efforts
[see the SearchStorage.com Tech Closeup on CAS
here].
Understanding CAS
At the simplest level, CAS is a specialized disk storage system.
Since archival data is not accessed frequently, high-performance
disks are not essential. In fact, most CAS platforms employ
ordinary
SATA hard disks for their low cost per
gigabyte, though SAS disks may be used when added performance is
needed to accommodate many simultaneous users. However, CAS
technology incorporates a unique feature set designed to
optimize storage space and improve long-term data
management.
 |
| Storage All-In-One Guides | | Learn more about storage topics like disk storage,
disaster recovery, NAS, and more in SearchStorage.com's
All-in-One Research
Guides. |
|
|  |
 |
CAS technology extends the use of
metadata to define a file. While any file
may include mundane date, time, name or creator metadata, CAS
allows a tremendous amount of additional information to be
stored along with the file. Extended metadata can be essential
for indexing and searching old data well into the future. For
example, a physician could use metadata to search through
patient files and retrieve X-rays from patients with a specific
physical condition. Metadata and index/search features are also
critical for meeting e-discovery or other litigation requests.
Encryption techniques are sometimes employed
to secure sensitive or confidential data.
Next, CAS data cannot be changed once it is archived. This
ensures data integrity and prevents tampering or spoliation. A
corporate regulatory audit or litigation discovery can proceed with
high confidence that the data being examined is original and
unaltered. Tamper-proofing is generally accomplished by treating
files as objects with unique designators and locations. Since most
archival data has a finite lifecycle, CAS also manages data
retention and disposal in accordance with regulatory or compliance
requirements. Data reaching its retention limit is systematically
deleted.
One persistent problem with traditional file copies is the
inevitable duplication of files. If there are 100 different copies
of an e-mail file attachment, all 100 copies are saved in the
backup. For long-term archival storage, this kind of inefficiency
can quickly exhaust available storage space. Another real strength
of CAS technology is in data deduplication (a.k.a. single-instance
storage or intelligent compression), which eliminates duplicated
blocks of data. Only one iteration of data is saved, and subsequent
copies of the file are simply referenced back to the one saved
copy. Consider a file-level example. If there are 100 attachments
and each is 2 MB in size, archiving to CAS would only take 2 MB to
save all 100 attachment references, instead of 200 MB with an
ordinary disk system. Experts note that data deduplication can
reduce data demands up to 50-to-1. Conventional
compression techniques may also be employed
to reduce disk space even further.
Power consumption is an important consideration. As CAS systems
scale up to hundreds of spinning disks, the power cost becomes
substantial. Some archive systems are employing creative solutions
to reduce power demands such as idling drives or powering idle
drives down completely. Low-power drives and emerging drive
technologies like "hybrid drives" can also help lower overall power
demands.
CAS products
Major vendors in the CAS market include -- in no particular
order -- EMC Corp., Nexsan Technologies, Sun Microsystems Inc.,
StorageTek, Permabit Inc., Hewlett-Packard Co., Bycast Inc., IBM
and Avamar Technologies Inc. Most CAS vendors possess a remarkably
similar view of CAS, though each vendor puts its own unique stamp
on the technology.
For most products, the main emphasis is on data deduplication where
redundant pieces of information are eliminated to reduce the total
archival storage requirements. EMC's Avamar product is particularly
notable for this feature; breaking files into small blocks that
Avamar called "atomics." When changes are made to a file, or a new
file is archived, only the new/changed blocks are actually stored
to disk. Reducing the total storage demands also speeds
conventional backups or restorations because there is less data
volume to transfer.
Once data is passed to CAS, it cannot change and must be
protected against theft so other CAS products emphasize the
immutability and security of archival data. Nexsan's Assureon
product incorporated AES 256-bit encryption to protect files
relegated to the archive. The Assureon also adds serialization to
track the existence of each CAS location and prevent file
tampering. Serialized locations can be scanned periodically to
verify the integrity of each file, and any files that are damaged
or incomplete can be dealt with promptly.
Still other CAS platforms embrace search and scalability
features. Search capability relies on sophisticated metadata to
help users to locate relevant file content long after the original
file creator may have forgotten about it. Scalability is important
to handle archival growth and handle huge numbers of CAS objects
(into the billions) over the long term. EMC and Sun products both
favor these areas.
Applications of CAS
CAS products are deployed in a wide variety of roles that extend
well beyond archival storage to embrace backup/restoration, improve
storage performance, meet regulatory requirements and save
significant costs.
The data deduplication features of CAS platforms are sometimes
used to reduce storage requirements. If corporate data can be
concentrated into some fraction of its original space, backups and
restorations (e.g. to tape or optical media) can be accomplished
faster – simply because there is less data to transfer. Lower data
volumes can also speed backup and replication tasks across WAN
links to remote sites. Lost or damaged files can be restored
directly from disk without the time or trouble of locating those
files on other media.
CAS is sometimes selected over other long-term storage options
to provide a superior user experience. For example, check images or
X-ray data stored to tape or disc must often be retrieved manually
once the corresponding media is located and loaded. It can take
hours (even longer) for end users to obtain data stored on
traditional media. A disk-based CAS system can keep that data
nearline and supply files on demand without any manual
intervention.