Looker_Studio - stock.adobe.com

Why AI agent recoverability is vital for business resilience

While AI agents are transforming operations they bring risks. Governance, monitoring, and instant rollback can help with resilience, trust, and safe innovation

This is a guest blog post by Richard Cassidy, EMEA CISO, Rubrik

Artificial intelligence has entered a new phase. Organisations are no longer just experimenting with machine learning models or predictive analytics. They are beginning to deploy autonomous AI agents that can make decisions and carry out tasks at machine speed, enabling them to decide and act almost instantaneously. While AI agents bring many business benefits, from faster operations to reduced manual efforts, they also introduce a new category of risk for organisations.

Treating AI agents as employees

As enterprises roll out AI agents, business leaders must manage them as they would new employees. These agents may feel like background technology, but their decisions can dramatically shape outcomes positively and negatively. Therefore, AI agents must be onboarded with least privilege, granting them only the minimum access and permissions necessary to perform their tasks. In the same way you would monitor human employees, they should also be continuously monitored and held accountable for their actions. Identity governance is an essential part of this picture as it’s the framework that defines how identities and their access rights are created and managed. However, governance alone is not enough. The reality is that mistakes will happen, no matter how carefully we apply controls. What matters is how quickly we can roll back to a safe state when these errors occur.

Conversations around recoverability become vital. If organisations cannot restore an AI agent to its previous state or undo the changes it made, then they are left exposed. The risk is not only operational downtime but also a loss of trust in the technology itself.

Why current approaches fall short

Today, most tools on the market fall into two categories. On one side, you have monitoring tools that can tell you what an AI agent did. On the other hand, you have guardrail tools designed to prevent certain behaviours. Both approaches have value - monitoring provides visibility while guardrails reduce the risk of accidents. But neither answers the most pressing question: what happens after the mistake?

If an AI agent deletes the wrong set of records, or pushes a faulty update through a workflow, knowing what happened is not enough. An organisation needs the ability to rewind to a clean state instantly.

Learning from the resilience playbook

Resilience has become a central principle in modern cybersecurity. Recent high-profile cyber-attacks in the UK have demonstrated that a focus on prevention alone is not enough; the ability to recover quickly and confidently is what counts. Many of these organisations are still dealing with the ongoing repercussions, operational disruption, and financial losses caused by breaches. With fast recovery vital to minimise damage, the same lesson now applies to agentic AI. While these systems deliver significant efficiency gains, they are equally prone to new forms of failure; when they go wrong, the impact can escalate far faster than any human error. In this context, resilience must mean more than identity management or activity monitoring. It requires the assurance that, when disruption occurs, organisations can recover and continue doing business without losing trust or control.

Forensic insight plus rollback

To continue innovating safely with AI agents, organisations must be ready to put in place thorough investigation processes to understand what went wrong and why. That means having forensic-level insight into the agent’s decision-making, not just a focus on the outcome. Only by understanding the reasoning behind an action can teams learn and prevent it from recurring.

Coupled with this is the need for instant rollback. If the agent makes a mistake, IT teams should be able to revert to a clean state within minutes. The combination of forensic insight and fast recovery is what gives organisations the confidence to use AI at scale without fear of disruption.

Preparing for non-human error

Organisational resilience has traditionally focused on human error, malicious attacks and system outages. Agentic AI introduces a brand new category: non-human error.

Organisations that prepare now, with governance and recoverability in place, will be able to innovate confidently. Those that do not may find themselves unable to trust the very tools they hoped would transform their operations.

Resilience is not just about defence. It is about ensuring continuity, maintaining trust and creating the conditions for a strong bounce back. As enterprises adopt agentic AI, recoverability must become a central part of that strategy. While AI agents offer significant operational potential, they also introduce systemic risk if not governed and controlled effectively.

Richard Cassidy is the EMEA CISO, Rubrik.

Read more on Business continuity planning