API series - Section: The why & how of distributing GraphQL

This is a contributed piece for the Computer Weekly Developer Network written by Daniel Bartholomew, CTO at Section.

Section is known for hosting and delivery of cloud-native workloads that are highly distributed and continuously optimised across a secure and reliable global infrastructure. Bartholomew is a regular speaker at industry events and experienced technologist in agile and containerised development.

His current role is to envision the technology organisations need to simplify and automate global delivery of cloud-native workloads.

Bartholomew writes as follows…

Sources such as Cloudflare note that API calls are the fastest-growing type of Internet traffic and GraphQL APIs are rapidly becoming a de-facto way that companies interact with data. While REST APIs still dominate, GraphQL has a significant advantage: it prioritises giving clients exactly the data they request… and nothing more.

As part of that, it can combine results from multiple sources – including databases and APIs – into a single response.

In short, it’s more efficient. So that can significantly impact bandwidth usage and application responsiveness and thereby both cost and performance.

However, the nature of the GraphQL structure means that caching responses for improved performance can be a significant challenge, so the secret to make GraphQL more efficient is distributing those GraphQL API servers so they operate (only and always) closer to end users, where and when needed.

A go-to strategy for go

Distributing application workloads is a go-to strategy to improve performance, reliability, security and a host of other factors.

When looking at API servers in particular, distribution results in high performance and reliability for the end user, lower costs for backend hosting, lower impact on backend servers, better ability to handle spikes, better security, cloud independence and (if done correctly) no impact on your development and management processes.

This last point is key, as deploying multi-cloud API services has historically been a largely manual process. But before we get to the ‘how’, let’s dig a bit deeper into ‘why’ you would want to distribute GraphQL servers.

Why Distribute GraphQL?

The performance angle is straightforward: by reducing last-mile distance, latency and responsiveness are considerably improved. Users will experience this directly as a performance boost. In managing the network, you can control how broadly GraphQL servers are distributed, thereby balancing and tailoring performance and cost.

The cost factor is impacted by, among other things, data egress. API servers specifically and microservice architectures in general, are designed to be very ‘chatty’.

When using a hyperscaler for cloud hosting, those data egress costs quickly add up. While there’s a lot that can be done to optimise and right-size the capacity and resource requirements, it’s incredibly difficult to optimise egress cost. Distributing GraphQL servers outside the hyperscaler environment (and potentially adding distributed caching with the solution) can minimise these traffic costs.

There are several aspects to decreasing the impact on backend services and the way in which the development teams operate.

Some are inherent to GraphQL: for instance, versioning is no longer an issue.

Without GraphQL, you have to be careful about versioning and updating APIs. With GraphQL as a proxy, you have flexibility. The GraphQL endpoint can remain the same even if the backend changes. Frontend and backend teams thus become more loosely connected, meaning they can operate at different paces, without blocking, so business moves faster. A given frontend can also have a single endpoint dedicated to it, called ‘Backend For Frontend’ (BFF), which further improves efficiency.

If caching is employed along with distribution, the impact of traffic on backend services demand is decreased as API results themselves can be captured and stored for reuse. Distributed API caching, done well, greatly erodes the need for distributing the database itself and again cuts down on cost.

Challenges in distributing GraphQL

However, there are challenges with GraphQL when trying to connect data across a distributed architecture, particularly with caching.

With GraphQL, since you are using just one HTTP request, you need a structure to say, “I need this information,” hence you need to send a body. However, you don’t typically send bodies from the client to the server with GET requests, but rather with POST requests, which are historically the only ones used for authentication. This means you can’t analyse the bodies using a caching solution, such as Varnish Cache, because typically these reverse proxies cannot analyse POST bodies.

This problem has led to comments like ‘GraphQL breaks caching’ or ‘GraphQL is not cacheable’.

While it is more nuanced than this, GraphQL presents three main caching issues:

Duplicate Cache – When there is a duplicate value for more than one key (e.g. when more than one URL leads to the same response or cache value).
Overlapping Cache – This happens when, due to aggregation, sections of the response/cache are the same with almost no differences, so that instead of the APIs being cached atomically, the GraphQL API call is cached, which takes longer than independent and asynchronous calls to each API.
Cache Times – Cache expires time (or TTL) becomes challenging, leading to portions of the GraphQL response becoming stale and immediately hurting the cache-hit ratio.

CDNs are unable to solve this natively without altering their architecture. Some CDNs have created a workaround of changing POST requests to GET requests, which populates the entire URL path with the POST body of the GraphQL request, which then gets normalised. However, this insufficient solution means you can only cache full responses.

Bartholomew: Knows his API nuances and nuisances.

For the best performance, we want to be able to only cache certain aspects of the response and then stitch them together. Furthermore, terminating SSL and unwrapping the body to normalise it can also introduce security vulnerabilities and operational overhead.

GraphQL becomes more performant by using distribution to store and serve requests closer to the end user. It is also the only way to minimise the number of API requests.

This way, you can deliver a cached result much more quickly than doing a full roundtrip to the origin. You also save on server load as the query doesn’t actually hit your API. If your application doesn’t have a great deal of frequently-changing or private data, it may not be necessary to utilise edge caching, but for applications with high volumes of public data that are constantly updating, such as publishing or media, it’s essential.

While there are multiple benefits to distributing GraphQL servers, getting there is typically not easy as it requires a team to take on the burden of managing a distributed network. Issues like load balancing/shedding, DNS, TLS, BGP/IP address management, DDoS protection, observability and other networking and security requirements become front and center. At a more basic level, how do you manually manage, orchestrate and optimise potentially hundreds of GraphQL servers?

These are the types of issues that have led to the rise of distributed hosting providers. The best of these use automation to take on the burden of orchestration and optimisation, allowing organisations to focus on application development and not API delivery. That said, there are specific considerations when it comes to GraphQL.

First, it will be necessary to host GraphQL containers themselves, not just API functionalities, thus eliminating Function as a Service (FaaS) as a distribution strategy. Moreover, it will be necessary to run other containers alongside the GraphQL server to handle caching, security, etc.

Cashing in on unlimited concurrency

Ideally, you also want to ensure scalability through unlimited concurrency, enabling the distributed GraphQL servers to support a large number of concurrent connections – exceeding the source database connection limit.

In the end, whether you roll your own solution, or use one of the cloud-native hosting providers, distributing GraphQL API servers and other compute resources will significantly improve both the user experience and the overall cost and robustness of application services. In short, it makes all the sense in the world for developers.