Platform Engineering - Sidero Labs: Central IT, now with YAML!

This is a guest post by Justin Garrison, head of product at Sidero Labs – the company specialises in Kubernetes infrastructure automation, developing tools including Talos Linux, a security-focused operating system designed specifically for Kubernetes deployments and Sidero Omni, a SaaS-for-Kubernetes solution.

Garrison writes as follows… 

Many platform engineering teams are building the wrong thing for the wrong people and they’re building it on infrastructure that all-but-guarantees failure.

The fact is that most platform engineering initiatives aren’t actually about helping developers. They’re about making life easier for security teams, CFOs and infrastructure managers who need centralised reporting and control. The platform team might think they’re building decentralised developer frameworks, but they’re really building central IT reporting with a Kubernetes logo slapped on top.

The kicker is they’re ignoring critical parts of the infrastructure that can save them a ton of time and solve problems for developers and stakeholders alike.

Who’s paying for the platform?

If you’re on a platform team, ask yourself who is funding your work. Look at your backlog and ask where your paycheck comes from. If the answer isn’t “development teams,” then developers aren’t your customers, no matter what your mission statement might say.

I constantly see platform teams funded by security departments building “golden paths” that are really non-optional guardrails.

Or teams funded by infrastructure groups building “developer productivity tools” that are actually cost control systems. Or teams funded by the CFO are building “acceleration platforms” that are fundamentally about standardisation and vendor consolidation.

While none of these are inherently bad goals, we should stop pretending they’re the same thing as developer enablement. Even when platform teams genuinely want to help developers, they don’t have the authority to say “no.” They have to build a single foundation for every possible use case.

Homer’s car problem IRL

Platform teams love to talk about reducing cognitive load and creating consistent experiences.

Then they build platforms that try to be everything to everyone. Like Homer Simpson’s car, nobody asked for it and nobody can use it. (The heyday Simpsons clip, with guest voice Danny DeVito, is worth your four minutes.)

In the database world, the industry realised that one size doesn’t fit all and built SQL for transactions, NoSQL for scale, time series for metrics, graph DBs for relationships, etc. Each was specialised for specific use cases and each was better at their own thing than any one-size-fits-all solution ever could be. There’s no bigger mistake in databases than using the wrong data structure for the wrong use case. You’ll create more work for yourself, you’ll limit your ability to scale and you’ll be incapable of swapping the wheels from a driving car.

Platform engineering doesn’t have that luxury and the single IDP will somehow need to meet the needs of your website tier, your backend team, databases, the PCI environment and that brand-new AI research team. It ends up being central IT pretending to be modern (e.g. it hasn’t worked and it won’t work).

Let’s not forget that platform teams are overhead departments. They don’t make the company money and are given a shoestring budget and a handful of ex-DevOps engineers. It’s not treated as a product and will never be given the respect it should have, given the impact it demands.

A cloud frontend is no platform

Even the best-designed platform struggles when it’s built on top of systems that were never meant to be platforms. Many teams aren’t being sabotaged by infrastructure, they’re just stuck automating what should have been abstracted.

Instead of choosing infrastructure designed for their use case, teams layer scripts and cloud-init files over general-purpose Linux in an effort to make it more understandable for its intended use. They’ve defined “platform” as a limited version of AWS for internal use only. Essentially, you’ve got a collection of resources with an AI-generated frontend, held together by YAML, bash and, more often than not, hope. In the words of the Google SRE book: “Hope is not a strategy.”

I watched one team spend six months building a beautiful developer portal. Self-service everything, GitOps workflows, full test automation. Then they started getting weird bugs. Same containers and same manifests, but completely different behavior across environments.

As it turned out, their “immutable” infrastructure was running on Ubuntu servers that had been patched on different schedules. Half their nodes were running different kernel versions. The platform looked perfect, but the foundation was brittle, hand-assembled and out of sync.

API-driven everything

Most platform teams aren’t failing because they lack automation, but because they have too much automation on top of the wrong foundation. Layers of tooling end up getting stacked on top of general-purpose infrastructure in an attempt to force consistency and control, but the real unlock isn’t more automation, it’s removing the layers that don’t belong. Fewer moving parts mean faster feedback loops and fewer surprises. Less software means less to patch, less to scan and less that can wake you up in the middle of the night. Instead of writing scripts to manage complexity, you eliminate that complexity entirely.

When SNCF, France’s national railway, modernised its infrastructure, it started with the conventional approach: Ubuntu, standard tooling and well-intentioned platform engineering. It lasted a year. The problem wasn’t effort, but excess. There were just too many layers doing too many things that weren’t actually needed.

What changed everything was switching to an OS that matched the platform’s operating model. With Talos Linux, every operation (from provisioning to patching to policy enforcement) became an API call. The team wasn’t just adding automation, they were working with something built to be controlled. Production incidents dropped by 90% not because they changed their workflows, but because they now had an infrastructure set up that didn’t wrestle with the platform’s goals.

The portability paradox

Most platform engineering setups are inmates within their own infrastructure choices. They work great in one specific environment until business reality hits and suddenly you need edge deployments, or multi-cloud for compliance, or different regions for latency. Because platforms are so tightly coupled to their specific infrastructure, “portability” thus equates to “rebuild everything from scratch somewhere else.”

Real platform portability isn’t about containers. Containers, I’d argue, were never truly the hard part. It’s about everything underneath behaving identically regardless of where it’s running (the same APIs, same security model, same operational characteristics, etc.), whether bare metal, cloud VMs, or edge devices with unpredictable constraints.

A little more context on the edge: while not every team needs to run at the edge, the ones that do cannot rely on infrastructure designed for datacentre assumptions. Edge environments are inherently limited with often-unstable networks, constrained compute and inconsistent power. Edge can and will amplify weak points in a platform’s design. You can’t rely on shell scripts and maintenance windows when your node is in a shipping container or on a satellite with a 30 minute SSH window. That’s all the more reason why consistency in your infrastructure layer matters. Everything doesn’t need to work the same everywhere, but platforms need to be intentionally designed for the environments they serve.

We need to stop building platforms for the wrong people. I’m not against platform engineering, but if your platform is funded by anyone other than development teams, you’re building management tools, not dev tools. (That’s fine, just call it what it is.)

If you are building for developers, stop building on infrastructure based on layers and layers of automation. A platform is only as reliable as its foundations, which means API-driven from the ground up, portable by design and strong enough to handle the edge.

My two cents is that the companies getting platform engineering right aren’t the ones with the fanciest UIs or the most sophisticated abstractions, but the ones that figured out who they’re really building for, what problems they’re solving and what infrastructure foundations make those solutions possible. Everything else is just central IT with better marketing.