OpenML AutoML Benchmark

The OpenML AutoML Benchmark provides a framework for evaluating and comparing open-source AutoML systems. The system is extensible because you can add your own AutoML frameworks and datasets. For a thorough explanation of the benchmark, and evaluation of results, you can read our paper which was accepted at the 2019 ICML AutoML Workshop.

NOTE: This benchmarking framework currently features binary and multiclass classification; extending to regression is a work in progress. Please file an issue with any concerns/questions.

Installation
- Pre-requisites
- Setup
Quickstart
Running benchmarks
Advanced configuration
Issues
Frequently Asked Questions

Automatic Machine Learning (AutoML) systems automatically build machine learning pipelines or neural architectures in a data-driven, objective, and automatic way. They automate a lot of drudge work in designing machine learning systems, so that better systems can be developed, faster. However, AutoML research is also slowed down by two factors:

We currently lack standardized, easily-accessible benchmarking suites of tasks (datasets) that are curated to reflect important problem domains, practical to use, and sufficiently challenging to support a rigorous analysis of performance results.
Subtle differences in the problem definition, such as the design of the hyperparameter search space or the way time budgets are defined, can drastically alter a task’s difficulty. This issue makes it difficult to reproduce published research and compare results from different papers.

This toolkit aims to address these problems by setting up standardized environments for in-depth experimentation with a wide range of AutoML systems.

Documentation: https://openml.github.io/automlbenchmark/

Features:

Curated suites of benchmarking datasets from OpenML.
Includes code to benchmark a number of popular AutoML systems on regression and classification tasks.
New AutoML systems can be added
Experiments can be run in Docker or Singularity containers
Execute experiments locally or on AWS (see below)

Installation

Pre-requisites

To run the benchmarks, you will need:

Python 3.7+.
PIP3: ensure you have a recent version. If necessary, upgrade your pip using python -m pip install -U pip.
The Python libraries listed in requirements.txt: it is strongly recommended to first create a Python virtual environment (cf. also Pyenv: quick install using curl https://pyenv.run | bash or brew install pyenv) and work in it if you don't want to mess up your global Python environment.
Docker, if you plan to run the benchmarks in a container.

Setup

Clone the repo (in development environment, you should of course remove the --depth 1 argument):

git clone https://github.com/openml/automlbenchmark.git --branch stable --depth 1
cd automlbenchmark

Optional: create a Python3 virtual environment.

NOTE: we don't recommend creating your virtual environment with virtualenv library here as the application may create additional virtual environments for some frameworks to run in isolation. Those virtual environments are created internally using python -m venv and we encountered issues with pip when venv is used on top of a virtualenv environment. Therefore, we rather suggest one of the method below:

using venv on Linux/macOS:

python3 -m venv ./venv
source venv/bin/activate
# remember to call `deactivate` once you're done using the application

using venv on Windows:

python3 -m venv ./venv
venv\Scripts\activate
# remember to call `venv\Scripts\deactivate` once you're done using the application

or using pyenv:

pyenv install {python_version: 3.7.4}
pyenv virtualenv ve-automl
pyenv local ve-automl

Then pip install the dependencies:

python -m pip install -r requirements.txt

NOTE: in case of issues when installing Python requirements, you may want to try the following:
- on some platforms, we need to ensure that requirements are installed sequentially: xargs -L 1 python -m pip install < requirements.txt.
- enforce the python -m pip version above in your virtualenv: python -m pip install --upgrade pip==19.3.1.

Quickstart

To run a benchmark call the runbenchmark.py script with at least the following arguments:

The AutoML framework that should be evaluated, see frameworks.yaml for supported frameworks. If you want to add a framework see HOWTO.
The benchmark suite to run should be one implemented in benchmarks folder, or an OpenML study or task (formatted as openml/s/X or openml/t/Y respectively).
(Optional) The constraints applied to the benchmark as defined by default in constraints.yaml. Default constraint is test (2 folds for 10 min each).
(Optional) If the benchmark should be run local (default, tested on Linux and macOS only), in a docker container or on aws using multiple ec2 instances.

Examples:

python3 runbenchmark.py 
python3 runbenchmark.py constantpredictor
python3 runbenchmark.py tpot test
python3 runbenchmark.py autosklearn openml/t/59 -m docker
python3 runbenchmark.py h2oautoml validation 1h4c -m aws
python3 runbenchmark.py autogluon:latest validation
python3 runbenchmark.py tpot:2020Q2

For the complete list of supported arguments, run:

python3 runbenchmark.py --help

usage: runbenchmark.py [-h] [-m {local,aws,docker,singularity}]
                       [-t [task_id [task_id ...]]]
                       [-f [fold_num [fold_num ...]]] [-i input_dir]
                       [-o output_dir] [-u user_dir] [-p parallel_jobs]
                       [-s {auto,skip,force,only}] [-k [true|false]]
                       framework [benchmark] [constraint]

positional arguments:
  framework             The framework to evaluate as defined by default in
                        resources/frameworks.yaml. To use a labelled framework
                        (i.e. a framework defined in
                        resources/frameworks_{label}.yaml), use the syntax
                        {framework}:{label}.
  benchmark             The benchmark type to run as defined by default in
                        resources/benchmarks/{benchmark}.yaml, a path to a
                        benchmark description file, or an openml suite or
                        task. OpenML references should be formatted as
                        'openml/s/X' and 'openml/t/Y', for studies and tasks
                        respectively. Use 'test.openml/s/X' for the OpenML test
                        server. Defaults to `test`.
  constraint            The constraint definition to use as defined by default
                        in resources/constraints.yaml. Defaults to `test`.

optional arguments:
  -h, --help            show this help message and exit
  -m {local,docker,aws,singularity}, --mode {local,docker,aws,singularity}
                        The mode that specifies how/where the benchmark tasks
                        will be running. Defaults to local.
  -t [task_id [task_id ...]], --task [task_id [task_id ...]]
                        The specific task name (as defined in the benchmark
                        file) to run. When an OpenML reference is used as
                        benchmark, the dataset name should be used instead. If
                        not provided, then all tasks from the benchmark will
                        be run.
  -f [fold_num [fold_num ...]], --fold [fold_num [fold_num ...]]
                        If task is provided, the specific fold(s) to run. If
                        fold is not provided, then all folds from the task
                        definition will be run.
  -i input_dir, --indir input_dir
                        Folder where datasets are loaded by default. Defaults
                        to `input_dir` as defined in resources/config.yaml
  -o output_dir, --outdir output_dir
                        Folder where all the outputs should be written.
                        Defaults to `output_dir` as defined in
                        resources/config.yaml
  -u user_dir, --userdir user_dir
                        Folder where all the customizations are stored.
                        Defaults to `user_dir` as defined in
                        resources/config.yaml
  -p parallel_jobs, --parallel parallel_jobs
                        The number of jobs (i.e. tasks or folds) that can run
                        in parallel. Defaults to 1. Currently supported only
                        in docker and aws mode.
  -s {auto,skip,force,only}, --setup {auto,skip,force,only}
                        Framework/platform setup mode. Defaults to auto.
                        •auto: setup is executed only if strictly necessary.
                        •skip: setup is skipped. •force: setup is always
                        executed before the benchmark. •only: only setup is
                        executed (no benchmark).
  -k [true|false], --keep-scores [true|false]
                        Set to true [default] to save/add scores in output
                        directory.

The script will produce output that records task metadata and the result. The result is the score on the test set, where the score is a specific model performance metric (e.g. "AUC") defined by the benchmark.

   task  framework  fold    result   mode   version                  utc       acc       auc   logloss
0  iris  H2OAutoML     0  1.000000  local  3.22.0.5  2019-01-21T15:19:07  1.000000       NaN  0.023511
1  iris  H2OAutoML     1  1.000000  local  3.22.0.5  2019-01-21T15:20:12  1.000000       NaN  0.091685
2   kc2  H2OAutoML     0  0.811321  local  3.22.0.5  2019-01-21T15:21:11  0.811321  0.859307       NaN
3   kc2  H2OAutoML     1  0.886792  local  3.22.0.5  2019-01-21T15:22:12  0.886792  0.888528       NaN

Running benchmarks

The automlbenchmark app currently allows running benchmarks in various environments:

in a docker container (running locally or on multiple AWS instances).
completely locally, if the framework is supported on the local system.
on AWS, possibly distributing the tasks to multiple EC2 instances, each of them running the benchmark either locally or in a docker container.

In Docker image

The Docker image is automatically built before running the benchmark if it doesn't already exist locally or in a public repository (by default in https://hub.docker.com/orgs/automlbenchmark/repositories). Especially, without docker image, the application will need to download and install all the dependencies when building the image, so this may take some time.

The generated image is usually named automlbenchmark/{framework}:{tag}, but this is customizable per framework: cf. resources/frameworks.yaml and HOWTO for details.

For example, this will build a Docker image for the RandomForest framework and then immediately start a container to run the validation benchmark, using all folds, allocating 1h and 4 cores for each task:

python3 runbenchmark.py RandomForest validation 1h4c -m docker

If the corresponding image already exists locally and you want it to be rebuilt before running the benchmark, then the setup needs to be forced:

python3 runbenchmark.py {framework} {benchmark} {constraint} -m docker -s force

The image can also be built without running any benchmark:

python3 runbenchmark.py {framework} -m docker -s only

In rare cases, mainly for development, you may want to specify the docker image:

python3 runbenchmark.py {framework} {benchmark} {constraint} -m docker -Xdocker.image={image}

In local environment

If docker allows portability, it is still possible to run the benchmarks locally without container on some environments (currently Linux, and macOS for most frameworks).

A minimal example would be to run the test benchmarks with a random forest:

python3 runbenchmark.py RandomForest test

The majority of frameworks though require a setup step before being able to run a benchmark. Please note that this step may take some time depending on the framework. This setup is executed by default on first run of the framework, but in this case, it is not guaranteed that the benchmark run following immediately will manage to complete successfully (for most frameworks though, it does).

In case of error, just run the benchmark one more time.

If it still fails, you may need to rerun the setup step manually:

python3 runbenchmark.py {framework} -s only

You can then run the benchmarks as many times as you wish.

When testing a framework or a new dataset, you may want to run only a single task and a specific fold, for example:

python3 runbenchmark.py TPOT validation -t bioresponse -f 0

On AWS

To run a benchmark on AWS you additionally need to have a configured AWS account. The application is using the boto3 Python package to exchange files through S3 and create EC2 instances.

If this is your first time setting up your AWS account on the machine that will run the automlbenchmark app, you can use the AWS CLI tool and run:

aws configure

You will need your AWS Access Key ID, AWS Secret Access Key, and pick a default EC2 region.

NOTE: Currently the AMI is only configured for the following regions so you'll have to set your default region as one of these:
- us-east-1
- us-west-1
- eu-west-1
- eu-central-1

On first use, it is recommended to simply copy the config.yaml from examples/aws to your user ~/.config/automlbenchmark folder (or merge it if you already have a config.yaml in this user folder) and follow the instructions in that file.

To run a test to see if the benchmark framework is working on AWS, do the following:

python3 runbenchmark.py constantpredictor test -m aws

This will create and start an EC2 instance for each benchmark job and run the 4 jobs (2 OpenML tasks * 2 folds) from the test benchmark sequentially, each job running for 1mn in this case (excluding setup time for the EC2 instances).

For longer benchmarks, you'll probably want to run multiple jobs in parallel and distribute the work to several EC2 instances, for example:

python3 runbenchmark.py AUTOWEKA validation 1h4c -m aws -p 4

will keep 4 EC2 instances running, monitor them in a dedicated thread, and finally collect all outputs from s3.

NOTE: each EC2 instance is provided with a time limit at startup to ensure that in any case, the instance is stopped even if there is an issue when running the benchmark task. In this case the instance is stopped, not terminated, and we can therefore inspect the machine manually (ideally after resetting its UserData field to avoid re-triggering the benchmark on the next startup).

The console output is still showing the instances starting, outputs the progress and then the results for each dataset/fold combination:

Running `H2OAutoML_nightly` on `validation` benchmarks in `aws` mode
Loading frameworks definitions from ['/Users/me/repos/automlbenchmark/resources/frameworks.yaml'].
Loading benchmark definitions from /Users/me/repos/automlbenchmark/resources/benchmarks/validationt.yaml.
Uploading `/Users/me/repos/automlbenchmark/resources/benchmarks/validation.yaml` to `ec2/input/validation.yaml` on s3 bucket automl-benchmark.
...
Starting new EC2 instance with params: H2OAutoML_nightly /s3bucket/input/validation.yaml -t micro-mass -f 0
Started EC2 instance i-0cd081efc97c3bf6f 
[2019-01-22T11:51:32] checking job aws_validation_micro-mass_0_H2OAutoML_nightly on instance i-0cd081efc97c3bf6f: pending 
Starting new EC2 instance with params: H2OAutoML_nightly /s3bucket/input/validation.yaml -t micro-mass -f 1
Started EC2 instance i-0251c1655e286897c 
...
[2019-01-22T12:00:32] checking job aws_validation_micro-mass_1_H2OAutoML_nightly on instance i-0251c1655e286897c: running
[2019-01-22T12:00:33] checking job aws_validation_micro-mass_0_H2OAutoML_nightly on instance i-0cd081efc97c3bf6f: running
[2019-01-22T12:00:48] checking job aws_validation_micro-mass_1_H2OAutoML_nightly on instance i-0251c1655e286897c: running
[2019-01-22T12:00:48] checking job aws_validation_micro-mass_0_H2OAutoML_nightly on instance i-0cd081efc97c3bf6f: running
...
[  731.511738] cloud-init[1521]: Predictions saved to /s3bucket/output/predictions/h2oautoml_nightly_micro-mass_0.csv
[  731.512132] cloud-init[1521]: H2O session _sid_96e7 closed.
[  731.512506] cloud-init[1521]: Loading predictions from /s3bucket/output/predictions/h2oautoml_nightly_micro-mass_0.csv
[  731.512890] cloud-init[1521]: Metric scores: {'framework': 'H2OAutoML_nightly', 'version': 'nightly', 'task': 'micro-mass', 'fold': 0, 'mode': 'local', 'utc': '2019-01-22T12:00:02', 'logloss': 0.6498889633819804, 'acc': 0.8793103448275862, 'result': 0.6498889633819804}
[  731.513275] cloud-init[1521]: Job local_micro-mass_0_H2OAutoML_nightly executed in 608.534 seconds
[  731.513662] cloud-init[1521]: All jobs executed in 608.534 seconds
[  731.514089] cloud-init[1521]: Scores saved to /s3bucket/output/scores/H2OAutoML_nightly_task_micro-mass.csv
[  731.514542] cloud-init[1521]: Loaded scores from /s3bucket/output/scores/results.csv
[  731.515006] cloud-init[1521]: Scores saved to /s3bucket/output/scores/results.csv
[  731.515357] cloud-init[1521]: Summing up scores for current run:
[  731.515782] cloud-init[1521]:          task          framework    ...         acc   logloss
[  731.516228] cloud-init[1521]: 0  micro-mass  H2OAutoML_nightly    ...     0.87931  0.649889
[  731.516671] cloud-init[1521]: [1 rows x 9 columns]
...
EC2 instance i-0cd081efc97c3bf6f is stopped
Job aws_validation_micro-mass_0_H2OAutoML_nightly executed in 819.305 seconds
[2019-01-22T12:01:34] checking job aws_validation_micro-mass_1_H2OAutoML_nightly on instance i-0251c1655e286897c: running
[2019-01-22T12:01:49] checking job aws_validation_micro-mass_1_H2OAutoML_nightly on instance i-0251c1655e286897c: running
EC2 instance i-0251c1655e286897c is stopping
Job aws_validation_micro-mass_1_H2OAutoML_nightly executed in 818.463 seconds
...
Terminating EC2 instances i-0251c1655e286897c
Terminated EC2 instances i-0251c1655e286897c with response {'TerminatingInstances': [{'CurrentState': {'Code': 32, 'Name': 'shutting-down'}, 'InstanceId': 'i-0251c1655e286897c', 'PreviousState': {'Code': 64, 'Name': 'stopping'}}], 'ResponseMetadata': {'RequestId': 'd09eeb0c-7a58-4cde-8f8b-2308a371a801', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'text/xml;charset=UTF-8', 'transfer-encoding': 'chunked', 'vary': 'Accept-Encoding', 'date': 'Tue, 22 Jan 2019 12:01:53 GMT', 'server': 'AmazonEC2'}, 'RetryAttempts': 0}}
Instance i-0251c1655e286897c state: shutting-down
All jobs executed in 2376.891 seconds
Deleting uploaded resources `['ec2/input/validation.yaml', 'ec2/input/config.yaml', 'ec2/input/frameworks.yaml']` from s3 bucket automl-benchmark.

Output

By default, a benchmark run creates the following subdirectories and files in the output directory (by default a subdirectory of ./results with unique name identifying the benchmark run):

scores: this subdirectory contains
- results.csv: a global scoreboard, keeping scores from all benchmark runs. For safety reasons, this file is automatically backed up to scores/backup/results.{currentdate}.csv by the application before any modification.
- individual score files keeping scores for each framework+benchmark combination (not backed up).
predictions, this subdirectory contains the last predictions in a standardized format made by each framework-dataset combination. Those last predictions are systematically backed up with current data to predictions/backup subdirectory before a new prediction is written.
logs: this subdirectory contains logs produced by the automlbenchmark app, including when it's been run in Docker container or on AWS.

Advanced configuration

If you need to create your own benchmark, add a framework, create a plugin for a proprietary framework, or simply want to use some advanced options (e.g. run some frameworks with non-default parameters), see the HOWTO.

Issues

If you face any issue, please first have a look at the Troubleshooting guide and check the existing issues. Any new issue should also be reported there.

Frequently Asked Questions

When will results be updated, also for the new/updated frameworks?

We don't perform a benchmark evaluation for each new package or update. Due to budget constraints, we can only do a limited number of evaluations. The next full evaluation will be performed before the end of the year 2020. We hope to find funding to guarantee regular evaluations.

(When) will you add framework X?

We are currently not focused on integrating additional AutoML systems. However, we process any pull requests that add frameworks and will assist with the integration. The best way to make sure framework X gets included is to start with the integration yourself or encourage the package authors to do so (for technical details see HOWTO).

It is also possible to open a Github issue indicating the framework you would like added. Please use a clear title (e.g. "Add framework: X") and provide some relevant information (e.g. a link to the documentation). This helps us keep track of which frameworks people are interested in seeing included.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OpenML AutoML Benchmark

Features:

Installation

Pre-requisites

Setup

Quickstart

Running benchmarks

In Docker image

In local environment

On AWS

Output

Advanced configuration

Issues

Frequently Asked Questions

Files

README.md

Latest commit

History

README.md

File metadata and controls

OpenML AutoML Benchmark

Features:

Installation

Pre-requisites

Setup

Quickstart

Running benchmarks

In Docker image

In local environment

On AWS

Output

Advanced configuration

Issues

Frequently Asked Questions