Archives retain business data that is only accessed
infrequently, yet must be kept for a prolonged period of time. For
example, the medical records of adult patients generally need to be
kept for at least seven years, but doctor's notes, X-rays and MRI
images may only be referenced during office visits several times
each year. Data retrieval poses special challenges for archival
storage systems, which must provide the capacity and reliability to
retain data for many years, protect that data against unauthorized
changes and quickly locate fragments of data from the archive on
demand. This overview highlights the key issues involved with
retrieving data from archives.
ImmutabilityArchives are often implemented to meet local or national
compliance regulations governing the availability and retention of
data, most often in financial and healthcare industries. Some of
the most well-known
compliance regulations include the Sarbanes-Oxley Act (SOX) and
the Health Insurance Portability and Accountability Act of 1996
(HIPAA), but there are about 15,000 other regulations that
businesses need to be aware of, as well. See the article
Storage Compliance Explained for more
details.
Immutability is often a main attribute of archival storage
systems. That is, once data is committed to an archive, it cannot
be changed or deleted until its retention period has expired. This
is often referred to as a WORM archive or content-addressed storage
(CAS). Files are typically assigned a unique identifier that is
stored along with the data when it's written to the archive. In
many cases, any data retrieved to support litigation must be from
an immutable archive -- otherwise there is no way to determine
authenticity of the data. Some archives can port to tape or virtual
tape libraries (VTL) so that archives can be backed up
periodically.
Extending storage capacityArchival storage capacity is always a concern since data is, as
mentioned above, generally immutable and cannot be deleted until
the retention period expires. This requires careful capacity
management to ensure that the archive does not run out of space.
One of the major technologies used to extend capacity is data
deduplication, also called intelligent compression or
single-instance storage.
Data deduplication works by eliminating redundant data from the
archive -- saving only one unique iteration of the file, block or
byte to the archive and replacing subsequent iterations with a
small pointer to the saved copy. In normal operation, a
deduplicated archive can achieve effective reductions from 10 to 1
up to 50 to 1. Today, most archives employ block- or byte-level
data deduplication to reduce storage demands.
Index and
searchAn archive can eventually contain hundreds of gigabytes or more
spread out across hundreds of millions of unique files. Retrieving
important data months or years later would be problematic at best,
so powerful indexing and searching capabilities are an essential
element of many archive platforms.
Indexing basically generates metadata details about each file and
possibly the contents of the file, and then organizes those details
into a database or repository of some sort with indices that can be
efficiently searched at a later date. Metadata may include details
like a filename, description, creator, creation date, key search
words, and many other items that are often customized to meet the
unique needs of each company. The index may be stored on the
archive along with the data. Search tools are actually used to
locate the data for retrieval. Depending on the actual search tool,
searches can utilize the metadata indexes or even "look inside"
some files, such as documents or .PDF files, to perform deeper
contextual searches of file content. For example, a healthcare
provider might search for records based on patient name, provider
ID and dates of service. Similarly, broader searches might be
performed for all patients sharing the same illness/diagnosis or
prescribed drugs. In many cases, search results are displayed by
relevance in a Web browser-based display similar to Google. Do not
underestimate the business importance of indexing and searching.
Retrieving needed files is crucial for compliance audits,
e-discovery and litigation support activities. When a demand for
discovery is made, a company typically has only weeks to locate and
provide the required data. Failure to tender data in a timely
fashion can have terrible financial consequences for a business.
Security and retentionData retrieval from an archive should also be restricted to
authorized personnel -- especially if the archive is not immutable.
Credentials should be required to authenticate each user, and a
detailed activity log should capture file access and track other
user activities within the archive. Solid security precautions will
reduce the chance that files are altered or deleted
unexpectedly.
Archives should also be implemented with well-defined data
retention and deletion policies in place. Archived data must often
be available for retrieval over years -- even decades -- so
retention is important to meet compliance and legal obligations.
Retention periods can vary by file type and may be set in metadata
during the file archiving process and generally cannot be changed
until deletion. Deletion is often an overlooked aspect of
retention. Experts suggest that there is greater legal exposure in
retaining unnecessary data (past its retention period) rather than
deleting it, so data should be securely destroyed as soon as its
retention period expires. This also frees up valuable space on the
archive. The archive platform itself will generally provide the
software needed to secure the system, and set retention and
deletion policies.