ustas -

How the Covid-19 Genomics UK Consortium sequenced Sars-Cov-2

Consortium of universities and other institutions has harnessed datasets, analytics and cloud computing to sequence Sars-Cov-2, the virus that causes Covid-19, in a blisteringly short time

Genomics, the study of genes, is a field of biology that relies on computing. While the ability to sequence – effectively, read – the human genome has gained much attention, researchers have been quietly working to use the same techniques to track and analyse diseases. This work stepped into the limelight in 2020 by focusing on Sars-Cov-2, the virus that causes Covid-19.

The UK’s work on this has taken place through the Covid-19 Genomics UK Consortium (Cog-UK), which as of 12 April 2021 had sequenced 428,056 samples.

Data from global repository Gis-Aid suggests that only the US has come close to this. Emma Hodcroft, a molecular epidemiologist at the University of Bern in Switzerland, described the UK’s sequencing work to the New York Times as “the moonshot of the pandemic”.

Genomic sequencing of viruses allows researchers to track mutations as they reproduce, allowing authorities to change strategies accordingly. The B117 variant of Sars-Cov-2, which is more transmissible than earlier strains, was first sequenced in September 2020 and formally identified as being of concern by Public Health England in December, contributing to the lockdown that month. Within the UK, B117 is often called the Kent variant, although other countries tend to call it the UK or British variant.

Origins of Cog-UK

Cog-UK was set up quickly, but it relies on technology and expertise developed over the years. Following a request from the UK government’s chief scientific adviser, Patrick Vallance, and a series of emails and phone calls, a group of about 20 people met at the Wellcome Trust in London on 11 March 2020.

“Most of the objectives and framework for Cog-UK were negotiated by the end of the meeting,” writes Sharon Peacock, professor of public health and microbiology at the University of Cambridge and executive director of the consortium.

The previous largest genomic viral dataset, from the Ebola outbreak in west Africa in 2014-16, contained about 1,500 samples. “Cog-UK surpassed this total within the first month and has continued to push viral genome surveillance on to an entirely different scale ever since,” says Peacock. The project launched with £20m of UK government funding on 23 March 2020.

Peacock describes Cog-UK as “a coalition of the willing” involving the UK government, the UK’s four public health agencies and a range of academic, NHS and public health organisations. Through 16 hubs, members sequence positive samples from people with Covid-19, with the Wellcome Sanger Institute in Cambridgeshire – which co-led the first sequencing of the human genome two decades ago – acting as the central sequencing hub.

The institute built on its previous work with malaria genomics to set up a highly automated pipeline process for Sars-Cov-2 that involves standardised file formats, quality control checks and editing to remove parts of the sequencing that are not required.

The institute runs its own datacentre, effectively a flexible private cloud with high-performance compute and storage. Peter Clapham, team leader for the high-performance computing (HPC) informatics support group, says a lot of the institute’s work involves large projects, including the UK Biobank, which tracks genomic and health data on 500,000 people, and the Tree of Life project, which aims to sequence DNA from all 70,000 organisms with a nucleus in the British Isles.

“We designed very early on a flexible system with our informatics customers that would allow us to adapt to what is needed,” says Clapham. For Cog-UK, it repurposed existing technology infrastructure rather than buying new equipment. “This has been a really good confirmation of the hybrid nature of what we’ve got, the flexibility we’ve managed to maintain and develop,” he adds.

Cloud infrastructure

Although the sequencing work is distributed, Cog-UK needed a central computing platform to hold the resulting data and allow analysis. Thomas Connor, professor in Cardiff University’s school of biosciences, attended the 11 March meeting with his colleague Nick Loman, professor of microbial genomics and bioinformatics at the University of Birmingham. Their universities, along with Swansea and Warwick, have collaborated on the Cloud Infrastructure for Microbial Bioinformatics (Climb) since 2014.

Climb provides microbiologists with the computing power, storage and tools required to carry out analysis of genomic data, with both universities having between 3,000 and 4,000 virtual CPUs available to support research using open source software including OpenStack for cloud computing and Ceph for storage. “It’s probably the largest dedicated system for microbiology of its type in the world,” says Connor.

For Cog-UK, Connor, Loman and colleagues set up Climb-Covid, a walled garden within Climb’s existing systems at Birmingham and Cardiff universities’ on-premise datacentres. This took about three days and uses only a small fraction of Climb’s capacity with research on other pathogens continuing.

“This is the advantage of having a cloud to play on,” says Connor, adding that the project has had a different impact on his own capacity. “My last year has been Covid.”

With 30,000 base pairs – effectively bits of genomic information – Sars-Cov-2 is a minnow compared with the 3.1 billion in human DNA. But the three sequencing machines used by Public Health Wales process genomes in blocks of just 400 base pairs, producing up to 120Gb of data a day.

Read more about genomics

“The computational challenge is taking that jigsaw and rebuilding it,” says Connor, who also works for the Welsh agency. The system also needs to handle metadata, including demographic details, location and information on how the sample was processed, and it has to do this quickly for it to be useful.

Public Health Wales typically processes samples in five days, rather than the months that would be normal for scientific research.

This is easier to do in Wales than in England. The country sequences Sars-Cov-2 from about two-thirds of positive lab-processed tests for Covid-19, discarding those with low levels of the virus because they are less likely to be viable. The Welsh NHS is more centralised than England’s, with a single laboratory information management system for pathology, making it easier to gather metadata.

“We can do things very rapidly here,” says Connor. “In England, things are a little more fragmented. Climb is providing a way to integrate that data.”

The two universities used Cog-UK funding to buy solid-state drives (SSDs) to increase Climb’s speed, bringing its storage capacity to 1.5PB of SSD and 2.8PB of disk. Connor says he is grateful for the way in which Cardiff’s supplier Dell and Birmingham’s supplier Lenovo rushed new equipment to them, as well as the support of HPC colleagues Simon Thompson at Birmingham and Christine Kitchen and Martyn Guest at Cardiff.

Repurposing existing work

As with generating and storing the genomic data, repurposing existing work is key to Cog-UK’s software-based analysis. David Aanensen, professor and senior group leader in genomic surveillance at the University of Oxford’s Big Data Institute, is also director of the Centre for Genomic Pathogen Surveillance, which is based at the Big Data Institute and the Wellcome Genome Campus, also the home of the Wellcome Sanger Institute.

The centre, founded in 2015, already had its software widely used to gather and analyse genomic data on diseases in poorer countries.

Aanensen and his team started working on Covid-19 as early as January 2020, mostly using existing funding as well as grants from the National Institute of Health Research. “All the partners have volunteered time and leveraged existing infrastructure and grants,” he says of Cog-UK.

Two of the centre’s existing software packages, Data-flo and Microreact, have been used extensively by Cog-UK partners. There are local instances of Data-flo, which manages epidemiological data pipelines, at Public Health Wales and Health Protection Scotland. These allow the agencies to use the open source software to link and visualise genomic data with personal and commercial information, including patient records and names of care homes.

Microreact, developed over the last five years with Wellcome funding to visualise and share data on genomic epidemiology, has been particularly widely used. The centre has installed local instances for Public Health Wales and Health Protection Scotland, but also the US Centres for Disease Control and Prevention and the European Centre for Disease Prevention and Control. It has also been used by other health authorities in Europe, as well as organisations in Argentina, Brazil, Colombia and New Zealand.

“The impact is huge, and we want data tools and ways of bringing high-quality information together to inform policy and action to be scaled,” says Aanensen. “Freely available software and an open data ethos is something we hold close to our hearts.”

As well as supporting its existing applications, the centre has created and adapted software during the pandemic. This includes a system that enables Cog-UK’s sequencing sites to upload speadsheet-format metadata on samples to Climb-Covid using a drag-and-drop interface, as well as ensuring validity.

It also produced a web wrapper for Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages), software that assigns Sars-Cov-2 genomes to lineages which is developed by a team led by Andrew Rambaut, professor of molecular evolution at the University of Edinburgh. This makes Pangolin easier to access, allowing it to process hundreds of thousands of samples and enabling users to view the global distribution of specific lineages, such as the B117 variant.

“Freely available software and an open data ethos is something we hold close to our hearts”
David Aanensen, University of Oxford

This meant increasing the capacity of computational and visual algorithms to cope with the volume of data collected through Cog-UK. For example, the tree viewer used to visualise relationships between genomes was moved from Canvas to Web GL, with an algorithm to reduce detail from a large number of samples. “Now we can display trees of several million, even though we’re not there yet,” says Aanensen.

This work fits with the centre’s aim of not developing software that is narrowly defined, with most of the focus on existing products. “Lots of processes have been accelerated,” says Aanensen of its work during the pandemic. This was primarily achieved through everyone doing more: “Essentially, we just doubled our workload.”

Aanensen says that having a number of sequencing labs joined up with computing has been a key strength of Cog-UK, an approach he sums up as “decentralised sequencing with centralised analysis”. He adds: “You have to deliver value at local sites, but contextualise local data in the broader picture.”

It has been refreshing to work with organisations across the UK, all fired up quickly and focused on delivery, he says.

Although Cog-UK’s work on the pandemic is not yet completed, those involved are excited about how future projects can build on it to go further. “This could be applied to any pathogen you care to look at,” says Thomas Connor at Cardiff University.

Samples of tuberculosis and gastro pathogens are already sequenced but rarely shared, and there is potential to sequence other infectious diseases, he says. “The value of sharing this kind of data fast has been demonstrated. That’s a really important legacy.”

Read more on Big data analytics