Modern development - Bloomberg: Of monoliths and meshes

This series is devoted to examining the leading trends that go towards defining the shape of modern software application development.

As we have initially discussed here, with so many new platform-level changes now playing out across the technology landscape, how should we think about the cloud-native, open-compliant, mobile-first, Agile-enriched, AI-fuelled, bot-filled world of coding and how do these forces now come together to create the new world of modern programming?

This contribution comes from Peter Wainwright in his role as senior engineer and author with the Developer Experience (DevX) team at Bloomberg, which develops the tools and processes the company’s 6,000+ software engineers use to manage its codebase.

The Bloomberg Terminal delivers an array of information, news and analytics to facilitate financial decision-making by professionals across the global capital markets.

It includes tens of thousands of different applications, from market information displays with very low latency, to analytic calculations for financial instruments and trading solutions that oversee the whole lifecycle of a trade.

The backend of all those apps often also serve enterprise use cases via APIs.

Bloomberg’s core backend has traditionally been composed of a few monolithic binaries and they still offer some advantages. Because of this, the development team has taken a hybrid approach to integrating service meshes into its infrastructure.

Wainwright writes to explain how…

The Monolith

The mega-monoliths (referenced above) are mostly C++ and code from thousands of developers is pulled into them.

They have a weekly cycle, with release branches that spawn at set times, each the start of a new release cycle — you might call this a ‘monoschedule.’ It is, quite deliberately, not very flexible. Deployment is slow but simple. Everyone knows the social contract to get changes submitted, testing completed and any needed fixes in. It works well, because it is predictable.

To counter the inflexibility of the ‘monoschedule,’ feature toggles allow changes to be flipped without interrupting users’ sessions — per customer, if required.

Scaling mountains

We first started building service meshes almost as an afterthought. If monolithic services are so great, why change?

Our services scale very well, up to the hardware’s limits. But they can’t scale further unless we deploy new hardware. Plus,you can’t do that on a moment’s notice. Our customers need us most when market activity peaks and handling the load is exactly the point.

Monoliths mean that high-traffic functions are deployed with the same mechanisms and priority as low-traffic ones. As we added more features, the different priorities among teams meant an increasing demand for special cases and exceptions, all of which had to be tracked.

We found that some analytics scaled better if we extracted them into services and just had the monolith route traffic to them. This solved the impedance mismatch the monolith created between teams and it brought the added advantages that we could deploy changes more rapidly. Because analyses often query one another, a mesh naturally started to develop

Complexity challenge

You might think the biggest challenge here is the migration from monolithic applications to isolated containerised microservices, but it really isn’t. Migrating a routine to a service often was easy. However, as more moving pieces were introduced, we had to invest in our tooling and our culture, too.

Replacing in-process calls with service calls introduces potential latency, timeouts and queueing issues. More subtly, service meshes are harder for engineers to reason about. These were problems they’d never had to consider before, let alone solve.

Distributed trace

One solution to the cognitive load problem is distributed trace. This allows an engineer to see all a service’s dependencies. Almost as important, downstream services can be aware of which upstream services rely on them.

A prime example is our company-wide “securities field computation service.” Calculations around the universe of securities can take place in many places. Knowing where to route requests for such computations is non-trivial, so we use a sort of smart proxy that becomes a “black box” router for requests.

Services provide data for requests without needing to know who’s asking. Unfortunately, this presents an obstacle when there are performance problems. Distributed trace restores visibility into “who I am calling” and “who is calling me” that the monolith previously gave engineers implicitly.

Safety at speed

To ensure engineers could focus on their applications, we built scalable platforms using open source solutions like Kubernetes, Redis, Kafka and Chef. Developers can use turnkey infrastructure for the heavy lifting and drop in their application code.

Since services work similarly, testing is easier. Instead of bundling changes on a fixed schedule, better testing enables us to make changes more rapidly. Changes are smaller, so there’s less that can go wrong and mitigation is simpler.

This is often framed in technical terms — defect rates, error budgets and the like. But the real benefit is psychological. Teams confident that their tools have their back will make more progress in a shorter time. That means we deliver more value to our customers faster.

Hybrid

Porting everything from a monolithic architecture is expensive, risky and also unnecessary. For functions that lack scalability concerns, there’s no reason to convert to a service-oriented architecture just because it’s trendy. We still have the monoliths and their weekly cycle. But they’ve evolved into services themselves.

The hybrid approach has worked well for us, by migrating only when the benefit justifies the increased complexity. We addressed engineers’ increased cognitive burden by investing in better visibility. In the end, it’s not about scaling our applications. It’s about scaling our engineers.

(Approved image source: Bloomberg)