Data storage components are at the core of any
enterprise storage system. At the very lowest level,hard discsare the medium
that hold vital corporate data.
From mundane memorandums to mission-critical sales records, the
choice of
hard discs can have a profound impact on the capacity,
performance and long-term reliability of any storage
infrastructure. But it's unwise to trust valuable data to any
single point of failure, so hard discs are combined into groups
that can boost performance and offer redundancy in the event of
disc faults. At an even higher level, those arrays must be
integrated into the storage infrastructure -- combining storage
with network technologies to make data available to users over a
LAN or
WAN. If you're new to storage, or just looking to refresh some
basic concepts, this chapter on data storage components can help to
bring things into focus.
The lowest level: Hard discs
Hard discs are random-access storage mechanisms that relegate
data to spinning platters (a.k.a. discs) coated with extremely
sensitive magnetic media. Magnetic read/write heads step across the
radius of each platter in set increments, forming concentric
circles of data dubbed "tracks." Hard disc capacity is loosely
defined by the quality of the magnetic media (bits per inch) and
the number of tracks. Thus, a late-model drive with superior media
and finer head control can achieve far more storage capacity than
models just six-12 months old. Some of today's hard drives can
deliver up to 750
Gbytes of capacity. Capacity is also influenced by specific
drive technologies including perpendicular recording, which fits
more magnetic points into the same physical disc area (a.k.a. areal
density).
The performance of a hard disc is heavily influenced by the
rotational speed (rpm) of the platters and the interface that
connects the drive to its host computer. Speeds from 5,400 to 7,200
rpm are most common in personal computers and secondary storage
systems, while 10,000 and 15,000 rpm discs are allotted to servers
and primary storage systems. The interface itself manages data
transfer to and from the drive. Both
ATA and
SCSI interfaces are traditional parallel architectures that
transfer commands and data across multiple data lines
simultaneously. ATA offered lower data rates and was mainly
employed in personal computers, while SCSI provided faster data
rates and appeared in workstations and servers.
SATA and
SAS are more current interfaces that pass ATA/SCSI commands
serially along a single data wire. The move to serial cabling
allows for faster data transfers and simpler (less expensive)
connections -- the interface has no direct impact on the capacity
of a hard disc.
Fibre channel (FC) is another popular serial hard disc
interface frequently found in enterprise storage environments. FC
is known for its tremendous speed; 2
Gbps and (more recently) 4 Gbps and data integrity features. FC
is also a switched interface, so it is possible to create a
"fabric" of storage devices and hosts where every host can see
every storage device -- vastly improving the availability of data.
This is a fundamental technology behind the SAN.
Grouping the discs: RAID
Hard discs are electromechanical devices and their working life
is finite. Media faults, mechanical wear and electronic failures
can all cause problems that render drive contents inaccessible.
This is unacceptable for any organization, so tactics are often
implemented to protect against failure. One of the most common data
protection tactics is arranging groups of discs into arrays. This
is known as a
RAID.
RAID implementations typically offer two benefits; data
redundancy and enhanced performance. Redundancy is achieved by
copying data to two or more discs -- when a fault occurs on one
hard disc, duplicate data on another can be used instead. In many
cases, file contents are also spanned (or
striped) across multiple hard discs. This improves performance
because the various parts of a file can be accessed on multiple
discs simultaneously -- rather than waiting for a complete file to
be accessed from a single disc. RAID can be implemented in a
variety of schemes, each with its own designation:
- RAID-0 -- disc striping is used to improve storage performance,
but there is no redundancy.
- RAID-1 -- disc mirroring offers disc-to-disc redundancy, but
capacity is reduced and performance is only marginally
enhanced.
- RAID-5 --
parity information is spread throughout the disc group,
improving read performance and allowing data for a failed drive to
be reconstructed once the failed drive is replaced.
- RAID-6 -- multiple parity schemes are spread throughout the
disc group, allowing data for up to two simultaneously failed
drives to be reconstructed once the failed drive(s) are
replaced.
There are additional levels, but these four are the most common
and widely used. It is also possible to mix RAID levels in order to
obtain greater benefits. Combinations are typically denoted with
two digits. For example, RAID-50 is a combination of RAID-5 and
RAID-0, sometimes noted as RAID-5+0. As another example, RAID-10 is
actually RAID-1 and RAID-0 implemented together, RAID-1+0. For more
information on RAID controllers, see the SearchStorage.com article
The new breed of RAID controllers.
A closer look at storage arrays
Of course, there are many ways to group hard discs and enterprise
storage can easily involve dozens to hundreds of discs arranged
into storage arrays. The very largest arrays can store hundreds of
terabytes (TB) (even
petabytes) of data. The most basic expression of disc grouping
is
JBOD. This is simply the accumulation of pure capacity, and
doesn't offer any redundancy or performance benefits. For example,
putting five 200 Gbyte drives in a JBOD arrangement simply yields 1
TB of unprotected storage.
As you saw above, RAID arrays group relatively small sets of
discs to work cooperatively for redundancy or added performance --
often both. However, redundancy costs drive space. Suppose there
are 10 200 Gbyte drives. That's 2 Gbytes of raw storage but
mirroring will cut that total in half to 1 Gbyte of mirrored
storage. Advanced RAID configurations like RAID-6 can ease the need
for redundant disc space by using parity techniques on a dedicated
drive. The parity data is then used to rebuild the data on a failed
drive.
Storage arrays can also be classified as modular or monolithic.
A modular storage array, like EMC Corp.'s Clariion AX100 is
typically small and self-contained with less than 24 drives,
designed for lighter traffic patterns found in small and-mid-sized
organizations. New modular arrays can be acquired to keep pace with
growing storage needs. In contrast, a monolithic storage array,
such as EMC's Symmetrix, Hitachi Data Systems Inc.'s Lightning and
IBM's DS8000, can be dramatically larger, hosting hundreds of
drives with the communication capability to handle heavy
utilization. The expense and management overhead needed for
monolithic arrays usually result in just a few key deployments. The
actual line between modular and monolithic arrays is blurring
somewhat today. There is no clear line of demarcation and the
features found in high-end arrays are frequently appearing in
smaller, lower end systems.
Clustering is a relatively new concept in storage. Storage
clusters are basically groups of storage arrays sharing redundant
connections to work cooperatively as a single storage system. The
use of multiple arrays can service storage requests very quickly,
resulting in superior performance while supporting large numbers of
users. There is also inherent redundancy -- when one element of the
cluster fails, the other elements take over without interruption to
ensure that data is continuously
available. Storage clusters are generally deployed where top
performance and storage system uptime are most crucial. You can
learn more about clustering
here.
Getting storage on the network
Of course, storage is useless unless network users can access
it. There are two principle means of attaching storage systems:
NAS and
SAN. NAS boxes are storage devices behind an Ethernet
interface, effectively connecting discs to the network through a
single
IP address. NAS deployments are typically straightforward and
management is light, so new NAS devices can easily be added as more
storage is needed. The downside to NAS is performance -- storage
traffic must compete for NAS access across the Ethernet cable. But
NAS access is often superior to disc access at a local server.
The SAN overcomes common server and NAS performance limitations
by creating a subnetwork of storage devices interconnected through
a switched fabric like FC or
iSCSI (called Internet SCSI or SCSI-over-IP. Both FC and iSCSI
approaches make any storage device visible from any host, and offer
much more availability for corporate data. FC is costlier, but
offers optimum performance, while iSCSI is cheaper, but somewhat
slower. Consequently, FC is found in the enterprise and iSCSI
commonly appears in small and mid-sized businesses. However, SAN
deployments are more costly to implement (in terms of switches,
cabling and host bus adapters) and demand far more management
effort.