Auto-tech series: eG Enterprise – Automation for modern monitoring

This is a guest post written by Rachel Berry in her role as technical product specialist at eG Innovations, a company known for its cloud-based application performance and IT-infrastructure monitoring solutions.

Berry’s work sees her centralise her professional engineering interests around an AIOps-powered single console monitoring solution, designed to monitor and automatically diagnose the root-cause of issues and in some cases auto-rectify problems for IT infrastructure and applications.

Speaking from her own experience working with modern automation technologies, platforms, tools and services, she suggests that – now, today – as automation becomes ubiquitous in how applications and infrastructure are deployed and managed, products are continually having to improve features associated with automation to overcome some inherent challenges and account for being included within IaC workflows where auto-scaling is ubiquitous.

Berry writes as follows…

Automated monitoring

Automation manifests itself at so many levels of the modern IT stack and monitoring is (I would like to politely propose) one of the most closely suited (if not the most) applications and disciplines of automation that we can consider at any level. Organisations now want monitoring that is plug-and-play i.e. deploys alongside new infrastructure and applications without manual intervention.

But automation to drive functions such as auto-scaling brings challenges.

On-premises systems with physical servers were (and still often are) static, so IT managers often configure monitoring manually in these scenarios – even manually setting metric alert thresholds. In the modern world of pay-as-you-go and auto-scale, containers and virtual servers or machines can be spun-up or down in seconds to service variable demand. This means monitoring vendors are investing a lot in [software] ‘agent’ technologies that ensure ephemeral resources are monitored as soon as they are created.

This often means products that can be directed at a cloud subscription account, a gateway or a delivery controller to discover the services and technologies in place and the way dependencies exist and are mapped out. Monitoring platforms now include APIs and configuration mechanisms so that monitoring can be set up using orchestration tools such as Terraform and Nerdio.

Technologies like OpenShift now have provisions for auto-deploying and configuring monitoring. We know that the Red Hat universal operator and an associated certification program is a case in point here. The need to automate is also raising the need for agent technology to be universally deployable – something that would be complicated if the agent licensing was module-by-module (so you’d have to install a different module on each server). Organisations are looking for standardisation and security verification if they are asked to deploy ‘things’ that can sniff, crawl and discover components in their systems.

Once deployed – what and how things are monitored needs to be automated.

Monitoring the IaC tools

When IT becomes reliant on automation to roll out infrastructure and apps, the very tools the orchestration relies on need to be monitored, particularly if they are cloud-based services.

Monitoring platforms now support critical DevOps tools, cloud services and middleware such as GitHub services (YAML and other templates are often stored in GitHub for IaC workflows), Jenkins, Apache Kafka and Qpid, Ansible and so on; as well as orchestration technologies such as Kubernetes and Docker.

Out-of-the-box metric thresholding

Manually configuring metric thresholds and alerting especially at scale isn’t practical in an auto-scale world and modern monitoring platforms leverage AIOps engines to learn and set dynamic thresholds based on what is learned as normal use.

This type of auto-baselining is now standard and enables automated anomaly detection.

However, there are significant challenges as without some sort of domain intelligence and some prior knowledge, monitoring tools can acquire a somewhat distorted view of what is ‘normal’. The load on a pre-production server can bear little relation to one in production and even mid-week public holidays can throw some systems.

Typically, some sort of mechanism combining both static and dynamic thresholds alongside some type of domain awareness is used to avoid alert storms based on crude measures such as X resource is being used 50% less than usual.

Configuration tracking

Real users often don’t raise issues immediately and help desk support calls can take time to reach ITOps teams, this is particularly true in sectors such as healthcare – a doctor isn’t going to leave their patient to call IT when he or she finds it’s slow to logon to patient records. Sometimes issues are intermittent and reported long after sessions or virtual servers have ceased to exist.

Configuration tracking and change tracking capabilities are becoming essential because monitoring data without the context of what hardware and infrastructure existed at the time of the incident is of limited use.

When configuration changes are automated they can introduce issues into the environment without human-involvement and so this information is vital to troubleshoot issues.

Automated actions (resolve issues)

Enterprise monitoring platforms now have capabilities to perform remedial actions upon systems automatically when triggered to do so by alerts on metric threshold breaches, certain events or anomaly detection.

Domain-aware intelligence is becoming very important as once an issue is automatically detected the system also has to understand the best way to resolve an issue. Some remediation actions are very generic e.g., if a service has failed – restart it, but most can be very specific to the technologies used e.g., Cloud desktop administrators often want to detect idle users, disconnect them, drain hosts and shut down unnecessary servers that will run up costs. The timescales at which a user is considered idle may be shorter after 6pm compared to 9-5 office hours.

Traceable auditing & Alerting of automation

In ITOps, the type of tasks that are automated can be highly intrusive and privileged, especially remediation operations such as rebooting servers, disconnecting sessions etc. In environments involving cloud, automation such as expanding and provisioning additional resources to meet demands can also come with hefty billing costs.

Once batch tasks, programmatic operations and service accounts become involved, the security and cost implications of being able to do things at scale without human sanity checks become a significant concern.

Many remediation strategies are also more about treating and resolving the symptoms of more subtle underlying root cause issues. Restarting a JVM to clear memory leaks, rebooting a server, adding more CPU / RAM, deleting nearly full log disks all can hide a myriad of nasty things. Long-term, you don’t want these things going on unless you understand the root causes.

Beyond stringent user controls on who can perform actions on your systems monitoring platforms should provide you with traceable audits allowing you see both manual operator-instigated actions and programmatic operations run against your systems via any functionality in the tool.

Safety checks on automation

Most administrators now also monitor actions, service accounts, cloud billing accounts with alerting in place to notify them or anomalous activity or unintended side-effects of automation such as large cloud costs.

It’s important to detect de-provisioning and report on de-provisioned systems.

Monitoring platforms face challenges operating in a world where things scale-down, when automation de-provisions or blows away infrastructure such as a virtual server, the monitoring in place must differentiate between system failures and deliberate reconfiguration of the environment.

Reporting has become a lot more challenging as it now must cover data on the behaviour of systems that often no longer exist. In many scenarios without configuration tracking data, many traditional reporting, capacity and cost-estimation planning tools have become redundant.

Take-aways

Enterprise and commercial products mostly now have reasonable answers to the challenges automation and parallel trends such as cloud usage bring – if you ask the right questions – you can find technologies that suit automated workflows.

For those choosing to build in-house or plumb together OSS monitoring technologies automation is certainly making it harder.