Yahoo Big ML (machine learning) team releases TensorFlowOnSpark

Yahoo!’s Big ML (machine learning) team comprising of Lee Yang, Jun Shi, Bobbie Chern and Andy Feng have confirmed that they are offering TensorFlowOnSpark to the community. This is the latest open source framework for distributed deep learning on big-data clusters.

The team says that it has found that in order to gain insight from massive amounts of data, they needed to deploy distributed deep learning. But (and here comes the reason for the new release) they also say that existing DL frameworks often require setting up separate clusters for deep learning, forcing them to create multiple programs for a machine learning pipeline.

Having separate clusters requires the team to transfer large datasets between them they say… and this introduces unwanted system complexity and end-to-end learning latency.

“Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters,” confirm the team.

The team says it uses CaffeOnSpark to improve NSFW image detection, to automatically identify eSports game highlights from live-streamed video.

With the community’s feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on Docker containers.

“This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe,” they say.  

After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016.

In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale.

According to the team, “To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. Databricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.”

The new framework, TensorFlowOnSpark (TFoS), is meant to enable distributed TensorFlow execution on Spark and Hadoop clusters. 

TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters.

The team also says that any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.