Archives retain business data that is only accessed infrequently, yet must be kept for a prolonged period of time. For example, the medical records of adult patients generally need to be kept for at least seven years, but doctor's notes, X-rays and MRI images may only be referenced during office visits several times each year. Data retrieval poses special challenges for archival storage systems, which must provide the capacity and reliability to retain data for many years, protect that data against unauthorized changes and quickly locate fragments of data from the archive on demand. This overview highlights the key issues involved with retrieving data from archives.
Archives are often implemented to meet local or national compliance regulations governing the availability and retention of data, most often in financial and healthcare industries. Some of the most well-known compliance regulations include the Sarbanes-Oxley Act (SOX) and the Health Insurance Portability and Accountability Act of 1996 (HIPAA), but there are about 15,000 other regulations that businesses need to be aware of, as well. See the article Storage Compliance Explained for more details.
Immutability is often a main attribute of archival storage systems. That is, once data is committed to an archive, it cannot be changed or deleted until its retention period has expired. This is often referred to as a WORM archive or content-addressed storage (CAS). Files are typically assigned a unique identifier that is stored along with the data when it's written to the archive. In many cases, any data retrieved to support litigation must be from an immutable archive -- otherwise there is no way to determine authenticity of the data. Some archives can port to tape or virtual tape libraries (VTL) so that archives can be backed up periodically.
Extending storage capacity
Archival storage capacity is always a concern since data is, as mentioned above, generally immutable and cannot be deleted until the retention period expires. This requires careful capacity management to ensure that the archive does not run out of space. One of the major technologies used to extend capacity is data deduplication, also called intelligent compression or single-instance storage.
Data deduplication works by eliminating redundant data from the archive -- saving only one unique iteration of the file, block or byte to the archive and replacing subsequent iterations with a small pointer to the saved copy. In normal operation, a deduplicated archive can achieve effective reductions from 10 to 1 up to 50 to 1. Today, most archives employ block- or byte-level data deduplication to reduce storage demands.
Index and search
An archive can eventually contain hundreds of gigabytes or more spread out across hundreds of millions of unique files. Retrieving important data months or years later would be problematic at best, so powerful indexing and searching capabilities are an essential element of many archive platforms.
Indexing basically generates metadata details about each file and possibly the contents of the file, and then organizes those details into a database or repository of some sort with indices that can be efficiently searched at a later date. Metadata may include details like a filename, description, creator, creation date, key search words, and many other items that are often customized to meet the unique needs of each company. The index may be stored on the archive along with the data.
Search tools are actually used to locate the data for retrieval. Depending on the actual search tool, searches can utilize the metadata indexes or even "look inside" some files, such as documents or .PDF files, to perform deeper contextual searches of file content. For example, a healthcare provider might search for records based on patient name, provider ID and dates of service. Similarly, broader searches might be performed for all patients sharing the same illness/diagnosis or prescribed drugs. In many cases, search results are displayed by relevance in a Web browser-based display similar to Google.
Do not underestimate the business importance of indexing and searching. Retrieving needed files is crucial for compliance audits, e-discovery and litigation support activities. When a demand for discovery is made, a company typically has only weeks to locate and provide the required data. Failure to tender data in a timely fashion can have terrible financial consequences for a business.
Security and retention
Data retrieval from an archive should also be restricted to authorized personnel -- especially if the archive is not immutable. Credentials should be required to authenticate each user, and a detailed activity log should capture file access and track other user activities within the archive. Solid security precautions will reduce the chance that files are altered or deleted unexpectedly.
Archives should also be implemented with well-defined data retention and deletion policies in place. Archived data must often be available for retrieval over years -- even decades -- so retention is important to meet compliance and legal obligations. Retention periods can vary by file type and may be set in metadata during the file archiving process and generally cannot be changed until deletion.
Deletion is often an overlooked aspect of retention. Experts suggest that there is greater legal exposure in retaining unnecessary data (past its retention period) rather than deleting it, so data should be securely destroyed as soon as its retention period expires. This also frees up valuable space on the archive. The archive platform itself will generally provide the software needed to secure the system, and set retention and deletion policies.
This was first published in May 2007