Home

Welcome to the FfDL wiki!

Short Name

Create and Deploy a Deep Learning platform on Kubernetes

Short Description

Deploy a Deep Learning Platform on Kubernetes, offering TensorFlow, Caffe, PyTorch etc. as a Service.

Offering Type

Cognitive

Introduction

This Code provides a fabric for a scalable Deep Learning on Kubernetes, by giving users to to leverage deep learning libraries such as Caffe, Torch and TensorFlow, in the cloud in a scalable and resilient manner with minimal effort. The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as graphics processing units (GPUs) and central processing units (CPUs), in an infrastructure as a service (IaaS) cloud.

Author

By Animesh Singh, Scott Boag, Tommy Li, Waldemar Hummer

Code

https://github.com/IBM/FfDL

Demo

N/A

Video

Overview

Training deep neural networks, known as deep learning, is currently highly complex and computationally intensive. A typical user of deep learning is unnecessarily exposed to the details of the underlying hardware and software infrastructure, including configuring expensive GPU machines, installing deep learning libraries, and managing the jobs during execution to handle failures and recovery. Despite the ease of obtaining hardware from infrastructure as a service (IaaS) clouds and paying by the hour, the user still needs to manage those machines, install required libraries, and ensure resiliency of the deep learning training jobs. Furthermore, the user must implement highly complex techniques for scaling and resiliency on their own, as well as keep pace with the updates to the deep learning frameworks in the open source communities.

Instead of being mired with infrastructure and cluster management problems, users would like to focus on training a model in the easiest way possible that satisfies both their cost and performance objectives. This is where the opportunity of deep learning as a service lies. It combines the flexibility, ease-of-use, and economics of a cloud service with the power of deep learning: It is easy to use using the REST APIs, one can train with different amounts of resources per user requirements or budget, it is resilient (handles failures), and it frees users so that they can spend time on deep learning and its applications. Users can choose from a set of supported deep learning frameworks, a neural network model, training data, and cost constraints and then the service takes care of the rest, providing them an interactive, iterative training experience. The job gets scheduled and executed on a pool of heterogeneous infrastructure, including GPUs and CPUs. A simple API (application programming interface) shields users from the complexity of the infrastructure and the advanced mechanics of scaling through distribution. Users can see the progress of their training job and terminate it or modify its parameters based on how it is progressing. When it is done, the trained model is ready to be deployed in the cloud to classify new data.

The value of DLaaS is not limited to data scientists, but extends to developers of new applications and services that would like to add deep learning capabilities but are not able/do not want to build their own software stacks and buy dedicated hardware, or handle scaling and resiliency in-house. Some prominent examples of usage of deep learning within application and services are: speech recognition [4], visual recognition [5], natural language understanding and classification [6], and language translation [7].

FfDL makes it easy for a provider of such consumer facing cognitive services to provide deep learning training to its users or use it to customize the models in order to provide better outcomes for its customers.

Flow

Inspect the available attributes in the Google BigQuery database for the Met art collection
Create the labeled dataset using the attribute selected
Select a model for image classification from the set of available public models and deploy to IBM Cloud
Run the training on Kubernetes, optionally using GPU if available
Save the trained model and logs
Visualize the training with TensorBoard
Load the trained model in Kubernetes and run an inference on a new art drawing to see the classification

Included components

TensorFlow: An open-source library for implementing Deep Learning models
Image classification models: A set of models for image classification implemented using the TensorFlow Slim high level API
New York Metropolitan Museum of Art: The museum hosts a collection of over 450,000 public art artifacts, including paintings, books, etc.
Google metadata for the Met Art collection: a database containing metadata for over 200,000 items from the art collection at the New York Metropolitan Museum of Art
Kubernetes cluster: An open-source system for orchestrating containers on a cluster of servers
IBM Cloud Container Service: A public service from IBM that hosts users applications on Docker and Kubernetes

Featured technologies

TensorFlow: Deep Learning library
TensorFlow models: public models for Deep Learning
Kubernetes: Container orchestration

Blog

Visualizing High Dimension data for Deep Learning

Links

IBM Cloud Container Service: A public service from IBM that hosts users applications on Docker and Kubernetes
TensorFlow: An open-source library for implementing Deep Learning models
Kubernetes cluster: An open-source system for orchestrating containers on a cluster of servers
New York Metropolitan Museum of Art: The museum hosts a collection of over 450,000 public art artifacts, including paintings, books, etc.
Google metadata for Met Art collection: A database containing metadata for over 200,000 items from the art collection at the New York Metropolitan Museum of Art
Google BigQuery: a web service that provides interactive analysis of massive datasets
Image classification models: A set of models for image classification implemented using the TensorFlow Slim high level API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly