Containerisation in the enterprise - Percona: Don't leave container costs to cluster luck

As businesses continue to modernise their server estate and move towards cloud-native architectures, the elephant in the room is the monolithic core business application that cannot easily be rehosted without significant risk and disruption.

These days, it is more efficient to deploy an application in a container than use a virtual machine. Computer Weekly now examines the modern trends, dynamics and challenges faced by organisations now migrating to the micro-engineered world of software containerisation.

As all good software architects know, a container is defined as a ‘logical’ computing environment where code is engineered to allow a guest application to run in a state where it is abstracted away from the underlying host system’s hardware and software infrastructure resources.

So, what do enterprises need to think about when it comes to architecting, developing, deploying and maintaining software containers?

This post is written by Sergey Pronin in his role as product owner at open source database company Percona.

Pronin writes as follows…

More and more companies are adopting containers, as well as Kubernetes, to manage their implementations. Containers work well in the software development pipeline and make delivery easier. After a while, containerised applications move into production, Kubernetes takes care of the management side and everyone is happy.

But it’s at this point that many developers start to face unexpected growth in their cloud costs.

The software engineering teams have often completed their part by setting up auto-scalers in Kubernetes to make applications more available and resilient. However, the issue is now that cloud bills can start to snowball. It’s important to track container usage and spend levels to get a better picture of where your money is going.

Your first step should be getting the required data from your containers and there are multiple exporters in Kubernetes that can provide these metrics. The most common tool used for this is Prometheus, an open source systems monitoring tool originally developed at SoundCloud. Prometheus can act as a container data source, sending that information out for collection.

As you move into production, you will probably have multiple clusters in place.

Each cluster should have its own Prometheus installation to act as a data collector. This provides data on activity per cluster which can then be gathered in one place for financial tracking and reporting overall. There are open source and free data collection and reporting tools available such as Grafana for data visualisation and specific products to hold data centrally for management and building dashboards.

Internally, we moved from Prometheus as our central data collection point to VictoriaMetrics, holding time series data in our open source monitoring tool [Ed – gratuitous plug permitted, this tool is open souce] Percona Monitoring and Management. VictoriaMetrics requires far less disk space and RAM while still achieving the same performance levels as we saw with Prometheus as a database.

Cloud bills provisioned, not used

Percona’s Sergey Pronin.

The most likely reason for your growing cloud computing bill is the cost of computing resources and storage.

Because a container implementation orchestrated by Kubernetes automatically scales up to meet demand, it also easily adds more nodes with compute resources linked to them. Similarly, public clouds bill for provisioned storage volumes rather than actual volumes used.

For example, an AWS Elastic Block Storage user will pay for 10 TBs of provisioned EBS volumes even if only 1 TB is really used.

This can skyrocket your costs. Each container will have its starting resource requirements reserved, so overestimating how much you are likely to need can add a substantial amount to your bill over time.

Don’t trust cluster luck

To track this effectively, you can create a dashboard using the data in your containers, measuring cluster health along with more specific information on what is taking place within Namespaces and Pods. This allows you to look at the utilisation rates for clusters based on what is available to them and then work out how much those clusters actually use.

For example, you may have a large volume of CPUs available for your containers. However, how many of those CPUs provisioned per container pod are being fully used over time? Depending on the utilisation you see in practice, you may be able to reduce the amount of compute you have in place. You can look at memory utilisation the same way. Alongside this, another useful visualisation to build is the difference between requests and real utilisation for CPU and memory. If you find that your utilisation rates are low, you can look at other factors in your applications and edit the setup accordingly. Options might include running your nodes with more memory and less CPU. This should automatically reduce your spending over time.

After looking at this high-level data, you can examine what is going on at a Namespace level. For example, you might have a Namespace that requests many cores at the beginning of a transaction but doesn’t actually need them. Tuning each request to ensure each container gets the appropriate volume of cores for its workload can reduce waste.

What are the next steps?

Looking at container behaviour data is useful but is not enough on its own.

Instead, you should look at bringing your data together in one place over time. This will reveal when your application behaviour is different from what you have provisioned at the start and provide you with an opportunity to make your cloud services implementations more efficient.