Online bookies need to cope with the usual peaks and troughs around football, horse racing, motorsports and other major sporting events, but few seem to attract the betting level of the Grand National, and if a site is overloaded or taking too long to respond, punters will take their business elsewhere.
People who don’t normally place bets are unsure if they have won anything or how much they have won, and if they are indeed lucky, they may well want to place a bet on something else running that day, such as a Premier League match.
All of this activity puts a huge amount of stress on the IT systems running online betting sites. Monitoring numerous back-end systems, data feeds and the user experience are essential in maintaining site reliability and ensuring people are able to place the bets they want.
Stephen Wild, observability manager at William Hill, runs a team of 10 that looks after everything going on with the IT at William Hill. “With observability, we can keep an eye on all our services,” he says. “To support this, the company chose New Relic as its observability platform.”
William Hill previously monitored the individual nodes that comprised its software stack. This, says Wild, is monolithic monitoring. The bookie has been on a journey to migrate workloads to the cloud, in a strategy to modernise IT infrastructure that was not able to cope well with the huge peak in people placing bets during major sporting events like the Grand National. “Every betting site is hit hard by the Grand National, because we get people who don’t normally bet, and if they don’t get feedback from the website, they go somewhere else,” he says.
The challenge for Wild and the observability team is how to tackle failures that only occur during peak betting periods. “In the past, it was a bit of a nightmare because we had infrastructure that wasn’t really built for the single huge day or huge week that we have,” he says. “It was built to handle the load over a year, which meant we seriously struggled with IT infrastructure that was collapsing around us.” This meant it was hard to pinpoint where failures were occurring.
William Hill no longer monitors individual nodes. The company has taken two years to migrate from physical machines to the cloud as part of its digital transformation strategy, and this has involved a change to how it does observability. “Our old monitoring platform wasn’t really doing what we wanted it to do, and wasn’t keeping up with our journey into the cloud,” says Wild.
Understanding the cost of performance degradation
Understanding the revenue impacts of technical outages across all production business services is a key objective in William Hill’s observability strategy. To help teams gain the real-time observability needed to achieve this, the observability team built a tool called Impact Listener on top of New Relic, which William Hill uses to track high-priority “P1” incidents. The tool can be mapped onto any business service and any metric in real-time to provide context and insights into service-impacting incidents during the entire incident lifecycle.
New Relic is the primary trigger to launch the Impact Listener workflow. Alerts for critical incidents are sent to PagerDuty.
“Impact Listener lets us prioritise what needs fixing first,” says Wild. “It shows where most of the revenue is being lost. There is an urgency to fix the problem that is costing us the most money.”
He adds that, thanks to Impact Listener, William Hill can now resolve 80% of P1 issues in one hour.
William Hill now uses New Relic as its observability platform. The service was selected following an extensive, three-month evaluation of the leading providers of observability tools, as rated in Gartner’s Magic Quadrant report for observability platforms.
One of the interesting observations of Gartner’s Magic Quadrant is the “Visionary” quadrant, which shows where new and emerging technology features are heading. These innovations tend to be developed by companies that are generally not known for their depth, breadth and market reach in the technology segment covered by Gartner’s analysis.
Even though New Relic may not have all the bells and whistles offered by those classified as visionary, Wild believes that established observability platforms invariably catch up relatively quickly.
“I’m not saying these features are copied because, let’s face it, a lot of what’s in an observability platform is a similar set of interfaces, but New Relic is willing to latch onto a requirement,” he says. “If we put in a feature request, for example, New Relic will take it seriously and invariably answer it very well.”
Real user monitoring in New Relic is used to understand the end user experience. “Users expect instant response even though it may only be a half-second delay on the site,” says Wild. “They’ll just not use the tech anymore and go somewhere else that’s quicker.”
When the Grand National was run as a virtual race during the Covid-19 pandemic, he says William Hill gained valuable insights into how the systems deployed for digital transformation and the migration to the cloud would cope. For Wild, the virtual Grand National demonstrated the resiliency of William Hill’s IT.
Listen to the podcast >>
Read more about site reliability
- Photobox’s site reliability head discusses how the photo book and personalised gifts site manages a complex microservices architecture.
- The lines between site reliability engineering and DevOps aren't always clear, but building a harmonious relationship between teams pays dividends for large cloud initiatives.