Sergey - orchestrates data pipeline with GKE

The price comparison site has used container orchestration to split its machine learning data pipeline into parallel processes

By using Google Kubernetes Engine (GKE), price comparison site has been able to parallelise its data pipeline. This is part of a wider deployment of analytics services on Google’s public cloud. It recently moved to Google Cloud Platform (GCP), which has enabled it to take advantage of the analytics services built into GCP.

“GCP is being used as our analytics cloud platform,” says Harvinder Atwal, head of analytics at “We did a proof of concept with Google. It has invested a lot on analytics services, which means there are many managed services on GCP, so our team of data scientists have a lot less to worry about.” He adds that GCP offers an easier-to-maintain analytics platform.

For its enterprise data warehouse, takes lots of data from its website, which goes into Google’s BigQuery. It uses Google Kubernetes Engine (GKE) to orchestrate a process through containerised applications that cleans the data and loads it into BigQuery.

“BigQuery is very fast and scalable,” says Atwal. “We don’t need to worry about fix sized queries and Google takes care of scaling. BigQuery becomes a main point of truth, and it also becomes the integration point for other data, enabling the data science team an to integrate third party data.”

Using BigQuery also helps speed up the process of extracting, translating and loading data (ETL) into its enterprise data warehouse. “It takes a lot of work taking raw data to ingest into a data warehouse,” says Atwal.

“The ETL pipeline can be quite brittle. Rather than wait for the ETL developers to create a data pipeline, we now ingest direct into GCP.”

He says has also created a training and model scoring pipeline using GKE and containers to break up model training into individual tasks.

Machine learning data pipeline

The various steps in the machine learning data pipeline involve data quality, preprocessing for normalisation and standardisation, feature extraction to identify new classes of data, model training, and assessments of model accuracy accuracy based on using a test data set.

The flexibility of GKE allows to use it for several projects, including machine learning (ML) and web-facing application programming interface (APIs) – using Python and mostly XGBoost as the ML classifier in the container application code.

Containerised data pipeline for ML uses ML to serve its personalised customer recommendations, and GKE forms the backbone of the ML model training and inference pipelines.

Each task in the model training pipeline – data extraction, feature engineering, model training and model evaluation – runs as containerised applications in GKE, and is orchestrated with Cloud Composer. By using GKE to orchestrate containers, the data pipeline for ML can be parallelised.

Using containerised applications for each pipeline task allows data scientists to make frequent incremental improvements through continuous integration and continuous deployment (CI/CD) practices.

“We can try out several types of algorithms and handle different sizes of data sets, and update any stage of pipeline without impacting other parts [of the data pipeline],” he says. did originally consider using a virtual machine (VMs) instead of containers to handle the ML data pipeline, but this would not have scaled well: “The downside we found is that we would have needed  to run whole end to end process on one VM,” says Atwal. “To scale up for many models and multiple customers, we would have needed a very large VM.”

Using a VM would also have meant the data model had to be processed sequentially – rather than in parallel – which is what it has managed to achieve using containerisation.

Read more about advanced analytics

  • A look at DataOps, agile analytics and an IT leader who is striving to make Boston a data-driven city: The Data Mill reports.
  • Better data governance, increased cloud use and wider DataOps adoption head the list of trends for data management teams to plan for in 2019, IT analysts say.

As Computer Weekly has previously reported, migrated from a legacy SAS analytics platform to GCP, using Google’s serverless software components, including BigQuery, Kubernetes, Dataflow and TensorFlow.

Moving to Google has enabled the company to simplify its data architecture. Based on Google’s reference architecture, has been able to deploy serverless and software-as-a-service technology, which meant there was no infrastructure to manage, enabling the data science teams to concentrate on getting their work done on GCP, says Atwal.

With its new analytics platform, has benefited most from the speed of development and running big tasks. In Harvinder’s own words, the most notable change has been the deployment time for its machine learning pipelines.

“We went from eleven hours down to about five minutes,” he says. That meant the models could be updated every day instead of once a week, which, in turn, led to more relevant communications and offers, ultimately helping customers to save more money.

Read more on Containers

Data Center
Data Management