MoneySupermarket.com orchestrates data pipeline with GKE

The price comparison site has used container orchestration to split its machine learning data pipeline into parallel processes

Cliff Saran, Managing Editor

Published: 21 May 2019 12:45

By using Google Kubernetes Engine (GKE), price comparison site MoneySupermarket.com has been able to parallelise its data pipeline. This is part of a wider deployment of analytics services on Google’s public cloud. It recently moved to Google Cloud Platform (GCP), which has enabled it to take advantage of the analytics services built into GCP.

“GCP is being used as our analytics cloud platform,” says Harvinder Atwal, head of analytics at MoneySupermarket.com. “We did a proof of concept with Google. It has invested a lot on analytics services, which means there are many managed services on GCP, so our team of data scientists have a lot less to worry about.” He adds that GCP offers MoneySupermarket.com an easier-to-maintain analytics platform.

For its enterprise data warehouse, MoneySupermarket.com takes lots of data from its website, which goes into Google’s BigQuery. It uses Google Kubernetes Engine (GKE) to orchestrate a process through containerised applications that cleans the data and loads it into BigQuery.

“BigQuery is very fast and scalable,” says Atwal. “We don’t need to worry about fix sized queries and Google takes care of scaling. BigQuery becomes a main point of truth, and it also becomes the integration point for other data, enabling the data science team an MoneySupermarket.com to integrate third party data.”

Using BigQuery also helps MoneySupermarket.com speed up the process of extracting, translating and loading data (ETL) into its enterprise data warehouse. “It takes a lot of work taking raw data to ingest into a data warehouse,” says Atwal.

“The ETL pipeline can be quite brittle. Rather than wait for the ETL developers to create a data pipeline, we now ingest direct into GCP.”

He says MoneySupermarket.com has also created a training and model scoring pipeline using GKE and containers to break up model training into individual tasks.

Machine learning data pipeline

The various steps in the machine learning data pipeline involve data quality, preprocessing for normalisation and standardisation, feature extraction to identify new classes of data, model training, and assessments of model accuracy accuracy based on using a test data set.

The flexibility of GKE allows MoneySupermarket.com to use it for several projects, including machine learning (ML) and web-facing application programming interface (APIs) – using Python and mostly XGBoost as the ML classifier in the container application code.

Containerised data pipeline for ML

MoneySupermarket.com uses ML to serve its personalised customer recommendations, and GKE forms the backbone of the ML model training and inference pipelines.

Each task in the model training pipeline – data extraction, feature engineering, model training and model evaluation – runs as containerised applications in GKE, and is orchestrated with Cloud Composer. By using GKE to orchestrate containers, the data pipeline for ML can be parallelised.

Using containerised applications for each pipeline task allows data scientists to make frequent incremental improvements through continuous integration and continuous deployment (CI/CD) practices.

“We can try out several types of algorithms and handle different sizes of data sets, and update any stage of pipeline without impacting other parts [of the data pipeline],” he says.

MoneySupermarket.com did originally consider using a virtual machine (VMs) instead of containers to handle the ML data pipeline, but this would not have scaled well: “The downside we found is that we would have needed to run whole end to end process on one VM,” says Atwal. “To scale up for many models and multiple customers, we would have needed a very large VM.”

Using a VM would also have meant the data model had to be processed sequentially – rather than in parallel – which is what it has managed to achieve using containerisation.

MoneySupermarket.com orchestrates data pipeline with GKE

The price comparison site has used container orchestration to split its machine learning data pipeline into parallel processes

Machine learning data pipeline

Containerised data pipeline for ML

Read more about advanced analytics

Read more on Containers

Google launches Parallelstore file storage at cloud AI training

Storage, backup services to expand on Google Cloud Platform

GCP gets triple-redundant NAS and built-in Kubernetes backup

Google Cloud pushes Filestore for enterprise, adds new features