Elastic developer advocate: Self-service tools & the end of blind production ownership
This is a guest post for the Computer Weekly Developer Network written by Carly Richmond, developer advocate lead at Elastic.
Elastic provides a search-powered platform for enterprise search, observability, and cybersecurity, helping organisations analyse data and secure systems in real-time at scale.
Richmond writes in full as follows…
The shift towards self-service developer platforms has reshaped how organisations build and operate software. For decades, production environments enforced a boundary between development and the operational teams responsible for deploying and supporting them.
How it used to be…
Developers wrote code, requested infrastructure and handed off to centralised ops. These patterns persist today. It makes production someone else’s responsibility, limiting how quickly developers fix problems and reducing their accountability.
Self-service introduces a new assumption: developers now provision infrastructure, deploy to production using automated tools like Terraform and Ansible and investigate failures and poor performance without waiting on cross-team handoffs. That autonomy removes bottlenecks. But autonomy alone can’t help developers own production applications. Developers need visibility into how their applications behave inside modern distributed systems.
Ownership of production is about understanding failures across services, dependencies, infrastructure and data tiers as much as deployment speed.
Incidents create bottlenecks across investigation, retention and risk scoping. To investigate effectively, developers must find relevant signals within dense telemetry. Data must be accessible and access must be reliably governed so mistakes do not compromise investigative clarity. Plus there’s regulatory and audit commitments to consider.
Manual ownership fails at scale
Engineering leaders lean on DevOps Research & Assessment (DORA) metrics because they quantify a truth developers feel: process friction increases development time and feedback latency undermines system reliability.
Elite engineering teams deploy multiple times per day using infrastructure-as-code practices to validate infrastructure and software changes against a comparable environment. This results in teams experiencing fewer failures impacting users, increased testing confidence using equivalent environment specifications to validate changes and restoring service faster than ticket-led handoffs.
Ticketing or request-based deployment processes delay communication and the back-and-forth can increase the lead time for changes.
Beware, failure blast radius
Manual handoffs slow deployments, introducing queue delays and stretching feedback loops across teams, tooling and boundaries. When telemetry is not available, defects are found late, debugging context is fragmented and the failure blast radius widens before anyone can act.
Silent failures demonstrate this compounding cost. Modern cloud services fail in edge-state transitions that dashboards do not expose, in services not fully instrumented, or in runtime conditions that metric aggregates reduce to nominal values. Metrics remain essential for modelling system thresholds – CPU, memory, latency, or error rates – of system limits, but these aggregates rarely capture the root cause.
Traces help map how requests move through service paths, but tracing assumes thorough instrumentation of every component in your ecosystem. Gaps persist when instrumentation discipline is the responsibility of every team shipping a microservice, relying on basic signals through automatic instrumentation missing event context.
Conversely, some components such as frontend services use Real User Monitoring agents.
When traces produce assumptions rather than guarantees, engineers pivot to logs for answers. Logs carry substantial investigative weight. They capture runtime errors, state changes and unfiltered production context. Metrics strip out context.
Correlated telemetry at each layer in the software ecosystem helps us diagnose issues in modern applications.Log parsing pipelines require complex maintenance. Teams have manually tuned regular expressions, normalised schemas via brittle ETL and cost-pressure-pruned logs under, often discovering missing signals after incidents compounded user impact. The cost was toil, fragmented visibility and elevated error risk.
The telemetry foundation developers need
When investigating production issues, developers rely on three types of telemetry signals: logs, metrics and traces. The challenge is to retain data in a way that is accessible, relevant and interpretable. Organise logs into logical fields that devs can validate, modify, or approve using conventions like OpenTelemetry Semantic Conventions.
Richmond: Self-service tools have transformed delivery, but developer autonomy works only when observability provides the context needed to understand system behaviour directly.
Investigations rarely involve a single event. Developers need correlation context that connects failures across services, requests, or runtime anomalies. Standards such as OpenTelemetry provide a structured way to link logs, metrics and traces, including context that maps failures to requests, deployments, or subsystem state in a vendor agnostic way.
During incidents, investigation speed matters as much as the data. Highlighting significant events while suppressing ‘noise’ allows developers to focus on the root cause rather than sifting fragmented logs. Metrics and traces provide supporting context, but logs often form the investigative narrative. Correlating OpenTelemetry signal context, prioritising important events and retaining logs gives developers needed diagnose issues when systems struggle.
Guardrails for autonomy
But visibility alone is not enough. Without governance, even the most detailed telemetry can become a source of risk. Even before we get to investigating production logs, including sensitive fields such as PII and other sensitive data can lead to regulatory breaches. One of the most destructive observability failure modes is not a code defect, but (mainly) accidental or mismanagement of telemetry data. ‘Pruned’ logs can mean an error removes the context needed to debug the next incident.
Guardrails can bound production access without reverting to rigid ticket queues. Role-based access control (RBAC), data segmentation and document and index-level entitlements allow developers to investigate production telemetry safely, limiting the risk of exposure to restricted data, accidental deletions or unauthorised changes. Document-level entitlements are useful where telemetry tools are provided as a shared service. The Principle of Least Privilege ensures that developers access only the telemetry needed .
In brief, production ownership works effectively only when telemetry is preserved, actionable signals are surfaced quickly and permissions prevent system-wide impacts. Guardrails protect critical data while supporting busy developers.
Enabling safe autonomy in software delivery
Self-service tools have transformed delivery, but developer autonomy works only when observability provides the context needed to understand system behaviour directly. Logs, metrics and traces carry the investigative narrative, threshold indicators and request paths engineers rely on, but signal correlation and search make this information actionable.
Searching by key terms and message patterns (such as ERROR text) can help quickly identify key log messages to explain root causes. RBAC and data-level entitlements are guardrails, eliminating noise from other apps, while automation reduces the need for risky manual retention or parsing decisions.
The teams that move fastest are those with full context, relevance-focused search and appropriately scoped access. Production ownership is safe and effective when modern observability, structured correlation, precise entitlements and automation work together. Teams that combine these elements are faster, safer, more resilient and empower developers to act decisively.
If you’re managing the DevSecOps process well, self-service simply supercharges great teams, when implemented with consideration.
