For many years, the dominant protocols to access shared storage have been block and file.
Block-based access provides the ability, as the name suggests, to update individual blocks of data on a LUN or volume, with granularity as low as 512 bytes. File-based protocols access data at the file level; an entire file is locked out for access, although the protocol (SMB or NFS) may allow sub-file updates.
File and block are great for certain classes of data. Block-based access works well with applications such as databases, whereas file protocols work well with files stored in a hierarchical structure.
But other data storage access requirements have arisen, at scale and in the cloud.
Here new de facto protocols are emerging, and a key one is Amazon’s S3 web services interface to its Simple Storage Service offering, a highly scalable public cloud storage service that uses objects rather than blocks or files.
An object is simply a piece of data in no specific format; it could be a file, an image, a piece of seismic data or some other kind of unstructured content. Object storage typically stores data with metadata that identifies and describes the content, and S3 is no exception.
S3 was first made available by Amazon in 2006. Today the system stores tens of trillions of objects. A single object can range from a few kilobytes up to 5TB in size, and objects are arranged into collections called buckets.
Outside of the bucket structure (which is there to provide admin and security multi-tenancy), the operation of S3 is a flat structure with no equivalent of the file structure hierarchy seen with NFS- and CIFS/SMB-based storage.
Objects are stored in and retrieved from S3 using a simple set of commands; PUT to store new objects, GET to retrieve them and DELETE to delete an object from the store. Updates are simply PUT requests that either overwrite an existing object or provide a new version of the object, if versioning is enabled.
In S3, objects are referenced by a unique name, chosen by the user. This could be, for example, the name of a file or simply a random series of characters. Other object platforms do not give the user the ability to specify the object name, instead returning an object reference. S3 is more flexible in this way and this makes it easier to use.
We have discussed the commands that are used to access S3, but how exactly are these commands executed?
Read more on object and cloud storage
- Object storage is a rising star in data storage, especially for cloud and web use. But what are the pros and cons of cloud object storage or building in-house?
- All but one of the big six storage suppliers have object storage products that target public and private cloud environments and/or archiving use cases.
S3 is accessed using web-based protocols that use standard HTTP(S) and a REST-based API.
REST or REpresentational State Transfer is a protocol that implements a simple, scalable and reliable way of talking to web-based applications. REST is also stateless, so each request is unique and doesn’t require the tracking using cookies or other methods employed by complex web-based applications.
With S3, PUT, GET, COPY, DELETE and LIST commands can be coded natively as HTTP requests in which the header of the HTTP call indicates the details of the request and the body of the call used for the object content itself. More practically though, S3 can be accessed using a number of software development kits for languages that include Java, .NET, PHP and Ruby.
Storage tiers in Amazon S3
There are three levels of storage tier available from Amazon, each of which attracts a different price. These are:
- Standard: General S3 capacity, used as the usual end point for data added to S3.
- Standard (Infrequent Access): A version of S3 capacity with lower levels of availability than Standard for data that doesn’t need to be highly available.
- Glacier: Long-term archive storage.
Each storage tier is priced differently. For example, in the EU, entry-level capacity for Standard is $0.03/GB, Standard Infrequent Access is $0.0125/GB and Glacier is $0.007/GB. There is also a charge per number of requests made and for the volume of data read from S3. There is no charge for data written into the S3 service.
S3 storage behind the scenes
Amazon does not provide any technical details on how S3 is implemented, but we do have knowledge of some technical points that help us understand the way S3 operates.
Amazon Web Services (of which S3 is only one service) operates from 12 geographic regions around the world, with new locations announced every year. These regions are divided into Availability Zones that consist of one or more datacentres – currently 33 in total. Availability Zones provide data resiliency and with S3 data this is redundantly distributed across multiple zones, with multiple copies of data in each zone.
In terms of availability and resiliency, Amazon quotes two figures: data availability is guaranteed to be 99.99% available for the Standard tier and 99.9% for Standard Infrequent Access.
Availability does not apply to Glacier as the retrieval of data from the system is asynchronous and can take up to four hours. The second figure is for durability. This gives an indication of the risk of losing data within S3. All three storage tiers offer durability of 99.999999999%.
Using S3 for your application
S3 provides essentially unlimited storage capacity without the need to deploy lots of on-premises infrastructure to manage it. However, there are a few considerations and challenges when using S3 rather than in-house object storage:
- Eventual consistency: S3 uses a data consistency model of eventual consistency for updates or deletes to existing objects. This means that if an existing object is overwritten, there is a chance a re-read of that object may return a previous version, as replication of the object has not completed between availability zones in the same region. Additional programming is needed to check for this scenario.
- Compliance: Data in S3 is stored in a limited number of countries and this may cause an issue for compliance and regulatory restrictions in some verticals. Currently there is no UK region, for example, although this is planned.
- Security: The security of data in any public cloud service is always a concern. S3 offers multiple levels of security that include use of S3 keys, S3 managed keys or customer-supplied encryption keys. Obviously, using keys from the customer means the customer has to put in place their own key management regime because loss of those keys would effectively make all data stored useless.
- Locking: S3 provides no capability to serialise access to data. The user application is responsible for ensuring that multiple PUT requests for the same object do not clash with each other. This requires additional programming in environments that have frequent object updates (for example, a read, modify, write process).
- Cost: The cost of using S3 can be significant when data access requirements are taken into consideration. Any data read out of S3 attracts a charge (although this is not the case if S3 is accessed by other AWS services, such as EC2). Also, customers may have to invest in additional network capacity to reduce the risk of bottlenecks between their own datacentre and those of AWS, depending how applications access S3 storage.
Amazon is in a strong position with S3, and most object storage software suppliers have chosen to adopt the S3 API as an unofficial de facto standard. This allows applications to be easily amended with little or no modification to use on-premises or cloud-based storage. Many of these suppliers hope they can provide added value on top of S3 and stay competitive in the marketplace.
Object storage is a rapidly growing part of the IT industry and as end-users get to grips with a slightly different programming paradigm, we are likely to see significant growth in this part of the storage ecosystem.