GlaxoSmithKline R&D creates data platform using Hadoop for the internal sharing of scientific data

GlaxoSmithKline is making better use of the data it has about the development and trials of medicines through a Hadoop-based platform

Karl Flinders, Chief reporter and senior editor EMEA

Published: 17 May 2018 17:15

Pharmaceuticals firm GlaxoSmithKline (GSK) has improved its research and development (R&D) capabilities through a programme to enable the sharing of data generated through the development of medicines across the R&D department.

In 2015, GSK embarked on a data strategy to address the challenge it faced in sharing data. There are around 10,000 scientists in GSK’s R&D operation, but very little data on medicine development and trials was shared between them.

Before the data strategy, which is now three years old, all data from medicine trials and experiments was in different formats and stored in different places, said Mark Ramsey, who was brought in as chief data officer for GSK’s R&D operation in 2015.

He said some work had been done on traditional data warehousing in the past, with attempts to structure and organise data using technologies such as Oracle and Teradata. “But what we were really looking for was something to tackle the problem on a broader scale,” said Ramsey.

“Pharmaceutical companies produce a large amount of data, but it is produced in vertical silos,” he said. “For example, in discovery there is experimental data produced which is used to progress individual new medicines, but there wasn’t really an ability to share that information across the R&D organisation and to use the power of the aggregation of that information to make better decisions.”

GSK recognised this was a constraint, so recruited Ramsey as a chief data officer to define a data strategy across the R&D operation so information could be used as a strategic asset rather than just for operations.

Bringing information together

The organisation made the decision to use Hadoop as the foundation to give it the ability to bring information from different operational sources together in the right format so it could start curating and rationalising it. Hadoop is an open source software for storing both structured and unstructured data.

The company had to start from scratch. “We put in place a new platform because the technology had not been used at GSK before,” said Ramsey.

It then integrated a number of other technologies to bring the data into the platform and rationalise it.

He said the project would never really end because the data team is constantly refining things and finding new use cases. Most of the work was completed in-house, at GSK’s global hubs, with none of the traditional systems integrator relationships, but it does work with a number of smaller specialists in areas such as data science and analytics.

To this end, GSK has built an ecosystem of about a dozen smaller software suppliers to support the platform. This includes California-based startup Waterline Data, for example, which provides metadata repository technology. This ensures that once the data is in the platform, GSK can search it and see where information exists and who has used it in the past.

GSK is also looking at using artificial intelligence (AI) in the development of new medicines using supercomputing technology.

GlaxoSmithKline R&D creates data platform using Hadoop for the internal sharing of scientific data

GlaxoSmithKline is making better use of the data it has about the development and trials of medicines through a Hadoop-based platform

Read more about Hadoop

Bringing information together

Read more on Data warehousing

GSK Acquires Bellus Health for $2 Billion

GSK Enters $1.9B Agreement to Develop Rare Cancer Treatments

Tony Wood Replaces Hal Barron as GSK’s Chief Executive Officer

3 Doses of Pfizer’s COVID-19 Vaccine Neutralizes Omicron Variant