Pager Duty CPO: Mature (automated) digital Ops saves incidents incidence

This is a guest post for the Computer Weekly Developer Network written by Sean Scott in his capacity as chief product officer (CPO) at PagerDuty – the company is known for its Real-Time Operations platform that integrates machine data & human intelligence to improve visibility & agility across organisations.

AI-powered automation has a crucial role to play in more defensive IT environments with an eye on risk, labour shortages, supply chain disruption and security – it helps by driving efficient event routing, securing processes, reducing labour-intensive manual tasks and improving security and the customer experience.

But to enable AI-powered automation, organisations require digitally mature operations to implement incident automation to join siloed security and IT departments, while also delivering improved organisational security and streamlined workflows… so how do we do this and get to a better future?

Scott writes in full as follows…

By automating incident management, such as detection and prioritisation, automation reduces the time it takes to respond and resolve incidents, reducing human errors and ensuring greater accuracy in incident management.

Incident automation also frees up time for IT teams, reducing the time and stress spent on manual tasks and therefore reducing employee burnout and turnover. It allows IT teams to focus on important innovation, which is particularly critical during times when there is less enterprise budget available.

The hands-on problem

Despite the value of automated incident management, manual tasks remain common in many organisations. A study conducted by Forrester found application deployment and delivery professionals are ‘still continuously overwhelmed with the number of manual tasks and volume of work’.

In our most recent State of Digital Operations Report, 14% of respondents classified their teams as “Manual” or “Reactive” and nine out of ten senior IT and development leaders decried current ITOps approaches, noting teams are spending nearly half of their time dealing with incidents rather than innovating, amounting to a financial hit of over $3 million per company per year.

Manual workers use a variety of monitoring tools, with high management overhead. Rule configurations must be maintained, increasing the effort needed to process event data and data must be organised across a number of siloed systems.

Principally affected are the mean time to resolve (MTTR) and mean time between incidents (MTBI) because siloed incident response systems and manual processes burden digital ops teams. This leads to a vicious cycle of high workloads, employee churn and poor performance. Survey respondents in Our State of Digital Operations report detailed some of the qualitative impacts that increased turnover has had on them and their teams. Included in their responses were ‘more on-call shifts’, ‘higher workload’ and ‘increased MTTR’.

A major threat to the security of an organisation is employee burnout, which happens when incident response teams are manually responding to deal with threats. According to VMWare’s Global Incident Response Threats report, 69% of respondents experience burnout symptoms and have contemplated leaving their work and we see it in our data. Increasing late-night notifications for on-call responders will lead to employees quitting.

Likewise, the PagerDuty report finds that respondents are being given less time to rest between on-call shifts and increased job responsibilities. It also found overworked and burned-out responders are bearing the burden of off-hour interruptions across industries and across revenue segments.

The study also found organisations of all sizes and all industries are poor in responding to incidents, an issue compounded by manual processes. Alert noise is another constant in many organisations, distracting responders who could be spending their time more productively. This friction inevitably leads to an inferior customer experience, a weaker architecture and less time for innovation.

Covid-19 impact

Pager Duty CPO Sean Scott: On top of his Ops

The pandemic increased digitisation and made automation essential as organisations control and reduce costs, support remote workforces and work to increase supply chain resilience. The crisis also saw the advent of mass remote working, resulting in parameterless networks and widespread adoption of personal devices, creating risk and complexity for IT teams.

Distributed workforces and freelance developers have also heightened security concerns. Forrester VP and research director Chris Gardner predicts that, in 2023, the growth of low-code and citizen development will lead to at least one headline security breach as many firms haven’t established governance policies around citizen development.

Post-Covid, investments in next-generation AI are crucial to bolster security and automate tasks. According to a UiPath study conducted by Forrester, nearly 50% of businesses globally will increase Robotic Process Automation adoption due to COVID-19 pressures and will use RPA to support both the back office and remote workforces. In the post Covid era of distributed workforces, incident automation solutions supporting collaboration and security have become critical.

The technology solution

Automating incident response processes enables efficiency, accuracy and faster response times, leading to better protection against evolving cyber threats and lower costs. According to the most recent IBM Cost of a Data Breach Study, organisations who have fully adopted security AI and automation save 65.2% in total breach costs.

That’s because automation expedites the security breach detection and response process, helping security teams remedy incidents. In the process, automation manages a variety of security tasks, including:

Threat investigation: AI-enabled tools can monitor the network for unusual behaviour and alert the security team to high-risk or suspicious activity.
Incident response: Security tools based on algorithms define how systems should respond to an event. Responses may include isolating a compromised device or deleting suspicious files.
Playbook creation: This security automation platform defines the workflows that the system will follow in a variety of security scenarios.
Endpoint protection: An endpoint protection platform (EPP) is a security tool that can automate device monitoring, as well as incident investigation and remediation.
Managing permissions: The platform can automate the provisioning and de-provisioning of accounts, as well as moderating requests for modifications or new permissions.
Reporting and compliance: The security automation platform manages routine logging and reporting activity, as well as flagging regulatory compliance issues.

Machine learning deployed alongside automation is critical as it constantly learns by analysing data to identify suspicious software and information patterns, for example malware in encrypted traffic, questionable online resources or even insider threats, based on unusual behaviour. Cyber threats need ongoing analysis of millions of data points and the speed of machine learning enables automated responses to adjust in real-time to a dynamic threat landscape.

AIOps solutions supply intelligence, automate and centralise event orchestration and noise suppression and can help organisations achieve 44% fewer incidents. This saves time, enabling teams to focus on new products of competitive advantage for the business. Intelligent automation technologies therefore make business sense. According to an analyst report by Futurum Research, more than half of organisations have adopted intelligent automation, as Robotic Process Automation RPA and AI become mainstream business solutions.

The solution is not just in the latest technology but also in a culture of investing in DevOps. The State of Digital Operations Report found investments in DevOps enables teams to accelerate their operational maturity growth. In turn, work hours become more consistent with improved burnout rates. This operational maturity correlates with a more even distribution of work amongst the team and a more efficient response to incidents.

The future of auto-Ops

In a market where investment can fluctuate we need continued focus and continued improvement in order to keep pace with growing organisational and systems complexity.Automation helps security teams adapt their response processes to respond more rapidly to the evolving threat landscape by providing real-time threat intelligence. This reduces response times and workload, while also improving the efficiency of incident response teams.

By automating incident triage and investigation, teams can quickly identify the cause of incidents. This means less time spent on manual tasks, less burnout and employee churn and more time spent on devising effective response strategies. Additionally, automated incident response workflows support consistency in the incident response process, reducing human error and supporting the customer experience.

Although new and automated processes can be exciting, it’s important to not neglect the human element. Expertise and judgement is a valuable tool in the incident response toolkit and it is important to invest in DevOps and security training.

Post-Covid, with the right internal support structures, we can reach a perfect balance of human expertise and automation, the former being a key point of difference for customers.