QuantumBlack finds data pipeline solace in Kedro

Data analytics company QuantumBlack is this month celebrating the launch of Kedro, its first open source project for data scientists.

But what is Kedro?

Kedro is a development workflow framework that structures a programmer’s data pipeline and provides a standardised approach to collaboration for teams building deployable, reproducible, portable and versioned data pipelines.

In 2015, QuantumBlack was acquired by McKinsey & Company — and the management consultancy has never before created a publicly available open source project.

Global of head of engineering & product for QuantumBlack at McKinsey is Michele Battelli — he asserts that many data scientists need to perform the routine tasks of data cleaning, processing and compilation that may not be their favourite activities but still form a large percentage of their day to day tasks.

He claims Kedro makes it easier to build a data pipeline to automate the ‘heavy lifting’ and reduce the amount of time spent on this kind of task.

In terms of use, Kedro allows developers to:

  • Structure analytics code in a uniform way so that it flows seamlessly through all stages of a project
  • Deliver code that is ‘production-ready’, making [theoretically] it easier to integrate into a business process
  • Build data pipelines that are modular, tested, reproducible in different environments and versioned, allowing users to access previous data states

QuantumBlack says it has used Kedro on more than 60 projects.

“Every data scientist follows their own workflow when solving analytics problems. When working in teams, a common ground needs to be agreed for efficient collaboration. However, distractions and shifting deadlines may introduce friction, ultimately resulting in incoherent processes and bad code quality. This can be alleviated by adopting an unbiased standard which captures industry best practices and conventions,” noted Battelli and team, in a press statement.

The Kedro team state that production ready code should have the following attributes — it should be:

  • Reproducible in order to be trusted
  • Modular in order to be maintainable and extensible
  • Monitored to make it easy to identify errors
  • Tested to prevent failure in a production environment
  • Well documented and easy to operate by non-experts

Battelli thinks that code written during a pilot phase rarely meets these specifications and can sometimes require weeks of re-engineering work before it can be used in a production environment.

Kedro also features data abstraction, to enable developers to manage how an application will load and save data — this is so they don’t have to worry about the reproducibility of the code itself in different environments.

Kedro also features modularity, allowing developers to break large chunks of code into smaller self-contained and understandable logical units. There’s also ‘seamless’ packaging, allowing coders to ship projects to production, e.g. using Docker or Airflow.

The team invites open contributions to the project and says that it is excited to see how it develops in the future — Kedro is here on GitHub.

Data Center
Data Management