Skip to content
Alex Imbrea edited this page Feb 27, 2021 · 1 revision

Eddy AutoML is an automated machine learning platform for streaming data.

The field of automated machine learning (AutoML) researches tools that try to find the best ML algorithm and parameters for a dataset with minimal user input and ML knowledge. Performing AutoML on streaming data rather than batch data introduces new challenges such as data drift, online algorithm selection, online meta-feature extraction, etc. Eddy AutoML provides an easy way to create, deploy and monitor ML jobs for streaming data using a user interface. The following screenshot shows how the metrics for a job look after deployment.

Eddy AutoML Job

Using streaming data represents a continuously growing trend in developing platforms architectures. Patterns such as Pub/Sub and Lambda architectures are now widely adopted and easier to deploy. Brokers such as Apache Kafka are the core of such architectures and ensure communication between all the (micro-)services involved.

Despite these developments, machine learning (ML) tools still lack compatibility with streaming data architectures and tools. The workarounds usually adopted by the industry is to store parts of the stream in a database and perform batch (also known as offline) ML on that data. This approach has a series of shortcomings such as concept drift over time, complicated architecture and model management. An alternative for this is to perform ML in an online fashion, directly using the data stream for training, testing and prediction.

Automated Machine Learning (AutoML) techniques aim to automate and optimize the process of ML algorithm selection and parameter optimization. However, all state-of-the-art AutoML tools are designed to work on batch data because they usually use statistical properties of the underlying data distribution in order to select which ML algorithm and parameters to use.

The result of this project consists of a free and open-source AutoML platform that can be deployed on any Kubernetes cluster both on public cloud (AWS, Google Cloud, Azure, etc.) or on-premises. It aims to be compatible with both existing Kafka clusters deployed as part of other architectures and new clusters deployed together with the platform. The goal of the application is to provide a UI that can help developers, engineers and researchers to easily deploy ML jobs with minimal knowledge of ML.

Clone this wiki locally