denisovd - Fotolia

Testing a four-million-user disaster recovery plan

Preparing broadcaster MTG's websites for the Winter Olympics won Load Impact the 2015 Computer Weekly European Private Sector Award

This article can also be found in the Premium Editorial Download: CWEurope: CW Europe – September 2015 edition:

In 2011, MTG acquired the rights to broadcast the 2014 Winter Olympics throughout Sweden. As that country is quite competitive in winter sports, MTG expected a significant amount of traffic accessing its various sites as well as the live stream offered on os.viaplay.se.

Load Impact, which provides software as service for load testing combined with professional services, was commissioned to perform a series of load and performance tests aimed at executing several "worst-case scenarios" ahead of the Olympics.

Load testing is increasingly important to ensure web applications remain running as website visits increase. The company was hired by the broadcaster's digital accelerator, MTGx, to ensure its websites remained running during the Olympics. It was contracted to identify where the bottlenecks were, fix them and then re-test to confirm the improvements.

The company’s task was to ensure everything bar streaming would continue to operate in the event of a worst-case scenario, testing authentication and authorisation. Michael Sjölin, head of professional services at Load Impact, says: "Normally this would not have been a huge task, but MTGx had a disaster recovery plan where the web would be the only channel for viewers if everything else failed."

Sjölin says the absolute worst case would have been a failure during the Sweden versus Canada ice hockey final, with everyone switching to watch it online.

The worst-case scenarios were:

  • Score is level during the final minutes of the hockey final (Sweden vs Canada).
  • Terrestrial broadcast is interrupted and, in a panic, users flood to tv3.se within a few minutes to find where they can stream the event online (os.viaplay.se).

"In the worst case, there could be four million users," said Sjölin. In common with modern web-based systems, MTGx’s websites had many connections to back-end systems, so Load Impact was also required to test integration.

Among the problems Sjölin faced was the aggressiveness of the testing MTGx required for disaster recovery. "We needed to ramp up a million users in a couple minutes, which is rather drastic and more aggressive than the Superbowl," he says.

Ramping up to this level of usage is not simple to do in the cloud, so preparation was key. First, Load Impact needed to work with MTGx to identify the datapoints that would be measured. “For each request, we collected six data points, multiplied by hundreds of requests per user,” he says.

"We were able to scale up very rapidly. We had to preload the load generators manually to ensure they were fast enough."

Common website optimisation problems

It is usually the case that websites do have enough bandwidth, says Michael Sjölin, head of professional services at Load Impact. Often, however, bandwidth may be limited somewhere due to network configurations. This was among the issues Load Impact identified with MTGx’s sites.

Sjölin says: "The most common problem with bandwidth is where you have 10Gbps networks attached to the servers but you have a 100Mbps firewall, which then limits bandwidth to 100Mbps."

From a tuning and optimisation perspective, the CPU can hamper web performance, he says, depending on the type of architecture – the bottleneck could exist on the front-end web servers or on the application servers. "CPU performance limits how quickly pages are served or rendered. The problem is often processing- or business logic-related," says Sjölin.

The third area to consider is the database. In Sjölin’s experience, problems here are usually down to configuration issues, such as limits on the number of concurrent queries or the frequency of updating indices. "Most often you see a queue of database requests," he says. "This is often a database configuration issue."

Using the Load Impact software and with support from Load Impact test consultants, the MTGx IT team simulated the average user load on the tv3.se homepage and then executed a sudden usage spike before users click the link to the Olympics website – os.viaplay.se.

Cache flow

During the first tests of both websites, Load Impact discovered a bug in the Sotchijunior site, says Sjölin. "Just as you don’t print a newspaper on demand, normally you do not want to produce static content on the fly. Serving content is cheap, but generating is expensive from a resources perspective."

Website optimisation normally involves deploying a cache to push static content through more quickly. But, clearly, not everything can be cached. For instance, live scores require fresh data.

On the Sotchijunior website, static content was being created on demand. "We needed to put it in a cache and balance content remaining current with the website being highly acceptable from a user perspective," says Sjölin.

MTGx’s web hosting provider, BaseFarm, then tweaked the Varnish caching platform to optimise the length of time it stored content, duration of the live connection and how long it allowed browser clients to cache content. Given the huge number of users on the MTGx sites, the optimisations affected thousands of users.

Read more on website optimisation

After removing the requests that went directly through the caching layer, Load Impact re-ran the test on Sotchijunior. The test results showed that the site could handle at least twice as much as the previous configuration – rising from 7,500 to 20,000 concurrent users.

Compression optimisation

The other issue Load Impact identified was with compression, which is often used tactically to improve the speed of downloading web pages.

Through testing, MTGx's IT team discovered a problem that meant all the content was delivered from the CDN to the Varnish cache in plain text rather than Gzipped. This meant far more bandwidth was consumed during communication between the Varnish cache and the CDN.

"Almost all content really benefits from being compressed with Gzip," says Sjölin. "Usually you can allow the content delivery network to compress, but this may waste CPU resources. It is a balance. We knew upfront there were plenty of CPU resources left, so we switched on compression."

Overall, the performance testing conducted by Load Impact helped MTGx's IT team improve its system performance in advance of the Winter Olympics. Without proper testing, the event could have caused a system failure, preventing many thousands of Swedes from watching their country compete for gold.

This was last published in August 2015

CW+

Features

Enjoy the benefits of CW+ membership, learn more and join.

Read more on Content management

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

-ADS BY GOOGLE

SearchCIO

SearchSecurity

SearchNetworking

SearchDataCenter

SearchDataManagement

Close