Databricks announced a new open source project called MLflow for open source machine learning at the Spark Summit this month.
The company exists to focus on cloud-based big data processing using the open source Apache Spark cluster computing framework.
The company’s chief technologist Matei Zaharia says that the team built its machine learning (ML) approach to address the problems that people typically voice when it comes to ML.
Typical ML challenges
A myriad tools – spread across each ‘phase’ of ML lifecycle development from data preparation to model training.
“Unlike traditional software development, where teams select one tool for each phase, in ML you usually want to try every available tool (e.g. algorithm) to see whether it improves results. ML developers thus need to use and productionize dozens of libraries,” noted Zaharia, in a blog.
He also notes that because ML algorithms have dozens of configurable parameters, it is difficult to track which parameters, (code, and data) went into each experiment to produce a model.
Zaharia explains that without detailed tracking, teams often have trouble getting the same code to work again. Reproducing steps makes debugging tough too, obviously.
“[It’s also] hard to deploy ML. Moving a model to production can be challenging due to the plethora of deployment tools and environments it needs to run in (e.g. REST serving, batch inference, or mobile apps). There is no standard way to move models from any library to any of these tools, creating a new risk with each new deployment,” said Zaharia.
What we have ended up with is big vendors producing internal ML platforms that do something of the job, but are limited in scope because they are tied to each company’s own technology infrastructure.
It is built with an open interface and so designed to work with any ML library, algorithm, deployment tool or language.
It’s also built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality.
“We’re releasing MLflow as an open source project that users and library developers can extend. In addition, MLflow’s open format makes it very easy to share workflow steps and models across organisations if you wish to open source your code,” said Zaharia.
Developers can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs.
Databricks says it is ‘just getting started’ with MLflow, so there is a lot more to come. Apart from updates to the project, the team plans to introduce major new components (e.g. monitoring), library integrations and extensions such as support for more environment types in the months and weeks to come.