jules - Fotolia
For many years, the dominant protocols to access shared storage have been block and file. Block-based access provides the ability, as the name suggests, to update individual blocks of data on a logical unit number (LUN) or volume, with granularity as low as 512 bytes. File-based protocols access data at the file level; an entire file is locked out for access, although the protocol – server message block (SMB) or network file system (NFS) – may allow sub-file updates.
File and block are great for certain classes of data. Block-based access works well with applications such as databases, while file protocols work well with files stored in a hierarchical structure.
But other data storage access requirements have arisen, at scale and in the cloud.
An object is simply a piece of data in no specific format: it could be a file, an image, a piece of seismic data or some other kind of unstructured content. Object storage typically stores data, along with metadata that identifies and describes the content. S3 is no exception to this.
S3 was first made available by Amazon in 2006. Today, the system stores tens of trillions of objects. A single object can range from a few kilobytes up to 5TB in size, and objects are arranged into collections called “buckets”.
Outside of the bucket structure, which is there to provide admin and security multitenancy, the operation of S3 is a flat structure with no equivalent of the file structure hierarchy seen with NFS-based storage, common internet file system (CIFS)-based storage or SMB-based storage.
Objects are stored in and retrieved from S3 using a simple set of commands: PUT to store new objects, GET to retrieve them and DELETE to delete them from the store. Updates are simply PUT requests that either overwrite an existing object or provide a new version of the object, if versioning is enabled.
In S3, objects are referenced by a unique name, chosen by the user. This could be, for example, the name of a file or simply a random series of characters. Other object platforms do not give the user the ability to specify the object name, instead returning an object reference. S3 is more flexible in this way and it makes it easier to use.
How exactly are these commands executed?
S3 is accessed using web-based protocols that use standard HTTP(S) and a REST-based application programming interface (API).
Representational state transfer (REST) is a protocol that implements a simple, scalable and reliable way of talking to web-based applications. REST is also stateless, so each request is unique and doesn’t require tracking using cookies or other methods employed by complex web-based applications.
With S3, PUT, GET, COPY, DELETE and LIST commands can be coded natively as HTTP requests in which the header of the HTTP call indicates the details of the request and the body of the call is used for the object content itself. More practically, though, S3 can be accessed using a number of software development kits for languages, including Java, .Net, Hypertext Preprocessor (PHP) and Ruby.
Storage tiers in Amazon S3
There are three levels of storage tier available from Amazon, each of which attracts a different price. These are:
- Standard: General S3 capacity, used as the usual end point for data added to S3.
- Standard (Infrequent Access): A version of S3 capacity with lower levels of availability than Standard for data that doesn’t need to be highly available.
- Glacier: Long-term archive storage.
Each storage tier is priced differently. For example, in the European Union (EU), entry-level capacity for Standard costs $0.03/GB, Standard Infrequent Access costs $0.0125/GB and Glacier costs $0.007/GB. There is also a charge per number of requests made and for the volume of data read from S3. There is no charge for data written into the S3 service.
S3 storage behind the scenes
Amazon does not provide any technical details on how S3 is implemented, but we do have knowledge of some technical points that help us understand the way S3 operates.
Amazon Web Services (AWS) – of which S3 is only one service – operates from 12 geographic regions around the world, with new locations announced every year. These regions are divided into availability zones that consist of one or more datacentres – currently 33 in total. Availability zones provide data resiliency, and with S3 data this is redundantly distributed across multiple zones, with multiple copies of data in each zone.
In terms of availability and resiliency, Amazon quotes two figures.
Data availability is guaranteed to be 99.99% available for the Standard tier and 99.9% for Standard Infrequent Access. Availability does not apply to Glacier, as the retrieval of data from the system is asynchronous and can take up to four hours.
The second figure is for durability. This gives an indication of the risk of losing data within S3. All three storage tiers offer durability of 99.999999999%.
Using S3 for your application
S3 provides essentially unlimited storage capacity without the need to deploy lots of on-premise infrastructure to manage it. However, there are a few considerations and challenges when using S3 rather than in-house object storage:
- Eventual consistency: S3 uses a data consistency model of eventual consistency for updates or deletes to existing objects. This means that if an existing object is overwritten, there is a chance a re-read of that object may return a previous version, as replication of the object has not completed between availability zones in the same region. Additional programming is needed to check for this scenario.
- Compliance: Data in S3 is stored in a limited number of countries, which may cause an issue for compliance and regulatory restrictions in some verticals. Currently, there is no UK region, for example, although one is planned.
- Security: The security of data in any public cloud service is always a concern. S3 offers multiple levels of security that include use of S3 keys, S3 managed keys or customer-supplied encryption keys. Obviously, using keys from the customer means the customer has to put in place their own key management regime because loss of those keys would effectively render all data stored useless.
- Locking: S3 provides no capability to serialise access to data. The user application is responsible for ensuring that multiple PUT requests for the same object do not clash with each other. This requires additional programming in environments that have frequent object updates (for example, a “read, modify, write” process).
- Cost: The cost of using S3 can be significant when data access requirements are taken into consideration. Any data read out of S3 attracts a charge, although this is not the case if S3 is accessed by other web services from Amazon, such as Elastic Compute Cloud (EC2). Also, customers may have to invest in additional network capacity to reduce the risk of bottlenecks between their own datacentre and those of AWS, depending how applications access S3 storage.
Amazon is in a strong position with S3, and most object storage software suppliers have chosen to adopt the S3 API as an unofficial de facto standard. This allows applications to be easily amended with little or no modification to use on-premise or in cloud-based storage. Many of these suppliers hope they can add value on top of S3 and stay competitive in the market.
Object storage is a rapidly growing part of the IT industry and, as users get to grips with a slightly different programming paradigm, we are likely to see significant growth in this part of the storage ecosystem.
Read more about object and cloud storage
- Object storage is a rising star in data storage, especially for cloud and web use. But what are the pros and cons of cloud object storage or building in-house?
- All but one of the big six storage suppliers have object storage products that target public and private cloud environments and/or archiving use cases.