Self-service tools - PagerDuty: The shift from ‘we deploy’ to ‘you deploy, we observe'

This is a guest post for the Computer Weekly Developer Network written by Mandi Walls, DevOps advocate at PagerDuty.

Walls, in her inimitable style, writes in full as follows…

Organisations are adopting developer self-service platforms and as with every iteration of the development model, there are teething problems where the new tools rub up against traditional methods, management and cultural practices.

The aim for all developer departments should be to find a happy situation where developers spend their day in development and deployment, while platform teams observe and respond.

So, with the rise of self-service developer platforms transforming how software is built and operated, developers expect – or hope – to deploy and manage services themselves without waiting on Ops teams. But such unfettered autonomy introduces new challenges that massively concern business leadership, including reliability, governance and incident management. These issues aren’t magically fixed as they move closer to the developer – who isn’t interested in looking after them at the expense of their main role.

Teams on the journey to full-service development models must carefully navigate the transition from the ops-driven ‘we deploy’ to a developer-led ‘you deploy, we observe’. From my perspective at PagerDuty, watching all sizes and types of businesses make the transition, platform and reliability engineering must evolve together to empower developers while ensuring uptime and trust at all costs.

Let’s clear up some misconceptions

Some voices in the industry claim platform engineering adds unnecessary complexity or re-centralises control. Not so. The real issue is balance.

Developers shouldn’t be shielded from operational context and deployment environments. Understanding production realities leads to better, more resilient software. Think of it like designing a car without ever seeing a road. That would make it hard to anticipate performance, wear, or real-life safety. Platform engineering done right enables actual productive autonomy, not ignorance. A productive and resilient team needs knowledge of production environments in order to make use of all the available resources and features they offer.

Reliability must be designed in, not added on later.

This is because the goal of self-service shouldn’t just be faster delivery; it should be sustainable delivery. Moreover, reliability is a team sport: Developers own their code in production. Platform engineers provide the tools, guardrails and telemetry. Reliability engineers ensure the whole system meets SLOs and SLAs. Without embedded reliability practices, self-service can quickly lead to operational overload and alert fatigue. The point is for true self-service to include observability, alerting and incident workflows from day one.

But, when it comes to autonomy, guardrails make us safe. Like automated readiness checks before deployment e.g., dependency health, recent incidents, or service ownership metadata so every service knows who’s on call. This isn’t a challenge to implement, you can have pre-defined escalation policies and playbooks built into platforms and error budgets/SLO tracking visible in dashboards for all to see. Such built-in ‘rails’ prevent surprises while keeping developers in control of release cadence.

Platform engineering is an evolution from the golden paths established for cloud native environments, encouraging good practices while allowing application developers to experiment with new technologies. Thus, we go back to platform teams acting as enablers.

Learn & return insights

When it comes to resilience, every frustrating incident teaches something about the tech environment and human processes.

Here’s our hot take at PagerDuty: post-incident learning is the real feedback loop for self-service improvement.

You incorporate learnings through simple techniques, applied regularly and championed widely. Update runbooks and automation templates. Adjust alert thresholds based on previous false positives. Refine escalation rules and ownership mapping with updates. Embed knowledge into smart policies and workflow automation tools regularly. And, ideally, you end up with a self-service platform that learns from the corporate experience.

Self-service doesn’t mean ‘everyone for themselves’, let’s remember that.

A shared responsibility is backed by a strong, clear process; governance and readiness can coexist with developer freedom… so it requires:

Clearly defined ownership – who’s on call, who maintains what, layer-by-layer.
Automated verification and deployment pipelines with embedded checks.
Continuous improvement moments – blameless postmortems, reliability retrospectives and the like, shared across all teams.

Developers gain confidence knowing the platform enforces reliability by design and the organisation benefits through continuous uptime and faster recovery when inevitable issues occur.

But really, what about in practice?

In the new model, ops and platform teams shift focus from performing deployments to observing and supporting them. This requires a few key functions:

Centralised observability dashboards for unified situational awareness.
Automated alert routing based on service ownership.
Real-time incident visibility and response orchestration.
Coordinated best practices across the organisation for alerts, log messages and telemetry to enhance observability across services.

If you really want to empower developers to own their services without isolating them, then collaboration must be built into observability.

The reliability mantra

Keep saying it as you practice – ‘reliability is everyone’s job’.

As developer self-service becomes the norm, reliability can no longer live only in an ops silo.

The most successful organisations will be those that make operational readiness and continuous uptime part of the developer experience itself. Consider this the mantra for the next generation of teams: “You deploy, we observe and we do it together.”