Like traditional hardware and software, cloud services are susceptible to network outages. This also applies to AWS, the market leader of public cloud providers.
As Werner Vogels, CTO of Amazon.com observes: "Everything fails, all the time." There is no silver bullet to increase the resilience of an AWS application. However, there is a set of good practices, which you can consider and follow.
AWS resources are organised in regions. All the regions provide (more or less) the same set or services. These services are "highly available" by default, so there is no need to give them special considerations unless availability and data is needed across regions.
Currently, there are more than 30 services that can be used in AWS. A region is comprised of two or more availability zones, each of which comprises one or more distinct datacentres.
First of all, to increase reliability, single point of failures should be avoided. In practice this means applications should continue to function if the underlying physical hardware fails or is removed.
In the case of a relational database, this could mean creating a secondary database and replicating the data. In case the main database server goes offline, the secondary server can pick up the load.
Needless to say, it is always a good idea to place these server systems in different availability zones, so if one fails, the server in the other zone can take over the load.
One of the most important rules is to make your system loosely coupled. Component in such loosely coupled systems operate without knowing the details of other components. Basically, each component is a black box for other components.
Loosely coupled systems have two great advantages: first, they allow systems to scale. A common way is to use a load balancer to decouple different systems and balance requests across different systems. Second, reliability is increased.
More articles on cloud resilience
Amazon provides different services to decouple systems and make them more reliable. One of the first services was Simple Queuing Services (SQS). Amazon describes SQS as a distributed queue system that enables service applications to quickly and reliably queue messages that one component in the application generates to be consumed by another component. Later, other services such as Simple Notification Service (SNS) or Simple Workflow Service (SWF) followed.
One of the main characteristics of the cloud is elasticity, which means not making any assumptions about the health, availability or fixed location of other components.
To implement elasticity, bootstrapping is needed. Bootstrapping allows the dynamic configuration of machines when booting up by assigning roles when they come online. It involves installing the latest data, register a service with DNS, update some packages or mount any devices.
There are different ways to bootstrap instances. These include bash and PowerShell scripts, as well as configuration tools such as Chef and Puppet. It is also possible to pass information to a EC2 instance, for example its role or the script to be executed.
At runtime EC2 instances can query their local MetaData and obtain the aforementioned information. Most EC2 instances include CloudInit too, which helps execute passed UserData on the first boot.
With Cloud Formation, parts of the infrastructure (or even the whole) can be written totally in code. Cloud Formation is a set of instructions that’s not only include how to boot up an EC2 instance, but also entire application stacks.
Outages can also happen due to security breaches. The best way to minimise such risk is to build security in. Data should always be encrypted, whether it is in transit or at rest. The principle of least privilege should be enforced.
Furthermore, security groups should be used in every layer and last but not least, multi-factor authentication is always highly recommended. Special consideration should also be given to the Master Account.
AWS has two important services, which help with scalability
While there is a Master Account, the recommendation is not to use it – instead Amazon Identity and Access Management (IAM) should be used to create users and groups. Also the use of a physical MFA device for the Management console login is recommended.
Another benefit of the cloud is the ease of scalability. There is no need to buy extra hardware. Instead you can easily provision additional instances. If there are scalability problems, consider distributing load across machines. Even if hardware is failing, it is easy just to replace the existing instance.
At runtime, if traffic increases, it is easy to get more capacity and, if the capacity is not needed anymore, it can easily be released. AWS has two important services, which help with scalability. The first is ELB (Elastic Load Balancer) and the second is Auto Scaling.
ELBs can be used in different scenarios. First, they can act as an external load balancer (mainly to increase scalability) and, second, they can act as an internal load balancer (to provide a loosely coupled system). ELBs can be reached via the hostname or with Route 53, if records should automatically resolved directly to IP addresses.
Traffic can be evenly distributed between one or more availability zones. Even with an Availability Zone, traffic is evenly distributed over its instances.
To increase the durability, EBS snapshots can be stored in Amazon S3
With auto scaling, it is possible to scale Amazon EC2 instances up and down automatically. Under the hood, Auto Scaling consists of four different components: A launch configuration, which defines which instances should be started, an Auto Scaling Group to describe how the system should be scaled, a Scaling Policy which configures possible events for Auto Scaling and last, some schedule information, when Auto Scaling should take place (optional).
Parallel architectures can be designed to increase the performance of an application. If designed properly, there is no additional cost, but the work can be done in a fraction of the time, which is normally needed.
In addition, there are different storage options in AWS. There is block storage, object storage, content delivery/edge caching, relational databases and NoSQL databases. It is crucial to find out the appropriate option for your system to increase resilience.
Block storage acts like a hard disk on a physical server and is ideal for OS boot devices, file systems or databases. However, while they are optimised for throughput, they can fail from time to time (Annual Failure rate 0.1 to 0.5%).
To increase the durability, EBS snapshots can be stored in Amazon S3. These snapshots can be seen as incremental backups, which means only the blocks on the device since the last snapshot will be saved. With snapshots, EBS volumes can be migrated across regions, hence the resilience will be increased.
If databases are used in AWS, there are means to increase reliability. Database mirroring can be used to maintain a hot standby instance. If data should be transferred between regions, replication can be used.
But even if a system is well designed for resilience and reliability, it is crucial to test it. One tool to try that is Chaos Monkey. Originally coming from Netflix, Chaos Monkey uses a variety of tests that makes applications highly available.
Chaos Monkey can work in different modes: as a simple monkey, it can kill any instance in the account; as a complex monkey, it can kill instances with specific tags or introduce faults, while a human monkey can kill instances from the AWS management console. Hence, with this tool, it can be checked, if a system is resilient against faults.
Guido Soeldner is a cloud infrastructure and virtualisation specialist working at Soeldner Consult GmbH - a German-based consultancy firm. Soeldner is also a regular contributor to Computer Weekly.