Skip to content

Latest commit

 

History

History
432 lines (314 loc) · 21.3 KB

policy.adoc

File metadata and controls

432 lines (314 loc) · 21.3 KB

Science Benchmark Policy: MLCommons® Science Benchmark Suite Rules

Points of contact:

1. Disclaimer

The MLPerf™ and MLCommons® name and logo are trademarks. In order to refer to a result using the MLPerf™ and MLCommons® name, the result must conform to the letter and spirit of the rules specified in this document. The MLCommons® organization reserves the right to solely determine if a use of its name or logo is acceptable.

2. Overview

Our goal is better science accuracy through benchmarks. Our main metric is the accuracy of the science. However, we will have secondary metrics that report time, space, and resources such as energy and temperature behavior.

The Science WG will use training as the primary benchmark. Here we specify the rules for training submissions.

The Science WG will have an Open divisions and submissions to the divisions must be indicated as "open".

The division submission should focus on improving the accuracy of a benchmark. The stopping criterion will be the value of loss specified in the benchmark. The benchmark can also include system performance as an additional result and give the logging information outlined in MLPerf HPC Rules. Power and temperature measurements may also be supplied.

The Open division submission aims to improve scientific discovery from the dataset specified in the benchmark which will specify one or more scientific measurements to be calculated in the submission. The result will be the value of these specified measurements from the submitted model. This model can be based on the supplied reference model or distinct. Data augmentation is allowed and all hyperparameters can be changed in the reference model if used. The result should be a GitHub (markdown) document starting with a table listing Measurement name, Reference model value, Submitted model value. For benchmarks with more than one measurement, an average difference between submitted and reference measurements should be given. Power and performance values are optional but encouraged in the results document. The resulting document should give enough details on the submitted model and any data augmentation so that the review team can evaluate its scientific soundness. Citations should be included to describe the scientific basis of the approach. Other rules for the Open division are as described in MLCommons® Training Rules, however, there are no special rules for the Open Science division.

To showcase the various aspects of the benchmarks and contrast it to with existing training benchmarks from MLCommons® at the time of writing of this document, we have included Table 1. The first line Non Science Training Closed refers to training conducted by other MLCommons® groups. The rows with Science in the Division show the target attributes for that division so the focus of the benchmarks can easily be contrasted. Each collumn showcases an attribute and how it is different between the divisions and other MLCommons non science benchmarks.

Table 1: Targeted aspects of the MLCommons® Science benchmark.

Division

Hardware System

Training data

Model

Test data / method

Primary Metric

Secondary Metrics (optional)

Science Open

Anything

Variable

Variable

Fixed

"Science Quality", Accuracy, Optional user-defined metrics

Time, memory, power

The secondary metrics include

  • Space

  • Time

  • Energy

  • Different datasets

All rules are taken from the MLPerf Training Rules except for those that are overridden here.

3. Relationship to the MLCommons HPC group

HPC working group has a focus on infrastructure and closed division Science working group has a focus on science and the open division.

While in HPC the focus of the benchmarks is on infrastructure, the focus here is on scientific accuracy. Nevertheless, the scientific benchmark applications could be used in some cases for HPC evaluation.

4. Benchmarks

The benchmark suite consists of the benchmarks shown in the following table.

Problem

Dataset

Quality Target

Earthquake Prediction

Earthquake data from USGS.

Normalized Nash–Sutcliffe model efficiency (NNSE), 0.8<NNSE<0.99, Details can be found in [3].

CloudMask

Multispectral image data from Sea and Land Surface Temperature Radiometer (SLSTR) instrument.

convergence target 0.9

STEMDL Classification

Convergent Beam Electron Diffraction (CBED) patterns.

The scientific metric for this problem is the top1 classification accuracy and F1-score (the higher the better). The main challenge is to predict 3D geometry from its 3 projections (2D images). Information about the best accuracy so far for this dataset can be found in [4]

UNO

Molecular features of tumor cells across multiple data sources.

Score: 0.0054

5. Divisions

The Science benchmarks are all considered to be part of an Open division.

5.1. Open Division

Hyperparameters and optimizers may be freely changed. For Science benchmarks this is the most important division as the goal is to improve the science and identify algorithms that optimize the science. For this reason, any algorithm and hyperparameter specification for that algorithm is allowed.

As this may include new algorithms we like to collect them as discussed in the Contribution section.

When specifying new algorithms, please provide us with the set of hyperparameters as defined by the examples given in this document.

Algorithms in the Open Division must be properly documented and archived in a GitHub repository with a tagged version so they can easily be reproduced. To be fully included the code must be archived in the official MLCommons® Science GitHub repository.

As the algorithms provided here can also be used in the open division we place the same rules on them as other algorithms.

Most importantly the scientific accuracy must be measured in the same fashion so that alternative implementations and hyperparameter choices can be compared with each other. Each science application provides a well-defined single or a set of comparative measures to evaluate the scientific accuracy. The measure(s) should be widely accepted by the science community

Algorithms that are not open source do not qualify for the science benchmarks as reproducibility and reviews are limited.

The Open division allows to use different preprocessing, model, and training method as the reference implementation.

Our current collation of benchmarks for the open division includes

Problem

Repository

EarthQuake

https://github.com/mlcommons/science/

CloudMask

https://github.com/mlcommons/science/

STEMDL

https://github.com/mlcommons/science/

CANDLE UNO

https://github.com/mlcommons/science/

Current hyperparameter and optimizer settings are specified in the section Hyperparameters and Optimizer. For anything not explicitly mentioned there, submissions must match the behavior and settings of the reference implementations.

In order to simplify the complex setup for scientific benchmarks, we recommend that all parameters are included in the config file when available. We recommend a YAML format for the config file.

6. Data Set

6.1. Data State at Start of Run

Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.

The data at the start of the benchmark run should reside on a parallel file system that is persistent (>= 1 month, not subject to eviction by other users), can be downloaded to / accessed by the user, and can be shared among users at the facility. Any staging to node-local disk or memory or system burst buffer should be included in the benchmark time measurement.

You must flush/reset the on-node caches prior to running each instance of the benchmark. Due to practicality issues, you are not required to reset off-node system-level caches.

We otherwise follow the training rule Data State at Start of Run on consistency with the reference implementation preprocessing and allowance for reformatting.

It is to be pointed out that the data set itself could be a parameter as part of increasing accuracy in the Open Division. For example including additional data may improve the outcome of the benchmark.

6.2. Default Data Sets

For the open division, we have a number of defined data sets for each benchmark that can be used for obtaining scientific results. This allows us an easier review.

6.3. Open Data Sets

For the open division, we also allow open data sets to be part of the submission if the submitter considers data augmentation achieves better science. The ability for us to review the dataset and instructions for replication will need to be supplied by the submitter. We will be introducing unique identifiers for the model and data to allow convenient identification of the input data and models.

6.4. Identifiers

All benchmark sources are contained in a GitHub repository and a tagged version is provided for all benchmarked applications. In addition, all data will be using a tagging mechanism and will be part of the benchmark submission. If the data fits in GitHub we will be using GitHub. Otherwise, we will be placing it in a data archive that is openly accessible.

We support the DataPerf MLCommons® working group studies to integrate such identifiers and when available will evaluate their integration.

7. Training Loop

Our focus is the training of data, but it may take considerable effort to prepare the data for the training loop. Such preparation and their performance is integrated into the benchmark.

7.1. Hyperparameters and Optimizer

Each application has its own hyperparameters and optimizer configurations. They can be controlled with the parameters listed for each application.

7.2. Hyperparameters and Optimizer Earth Quake Prediction

Model

Name

Constraint

Definition

Reference Configuration

Earthquake

TFTTransformerepochs

0 < value

num_epochs

config, UVA

Earthquake

TFTTransformerbatch_size

0 < value, example: 64

batch size to split training data into batches used to calculate model error and update model coefficients

config, UVA

Earthquake

TFTTransformertestvalbatch_size

max(128,TFTTransformerbatch_size)

this is a range between min and max for batch size

config, UVA

Earthquake

TFTd_model

0 < value. Example: 160

number of hidden layers in model

Earthquake

Tseq

0 < value. Example 26

num of encoder steps. The size of sequence window, number of days included in that section of data

config, UVA

Earthquake

TFTdropout_rate

9.9 < value. Example: 0.1

dropout rate: the dropout rate when training models to randomly drop nodes from a neural network to prevent overfitting

config, UVA

Earthquake

learning_rate

0.0 < value. Example: 0.0000005

how quickly the model adapts to the problem, larger means faster convergence but less optimal solutions, slower means slower convergence but more optimal solutions potentially fail if the learning rate is too small. In general, a variable learning rate is best. start larger and decrease as you see fewer returns or as your solution converges.

config, UVA

Earthquake

early_stopping_patience

0 < value. Example: 60

Early stopping param for Keras, a way to prevent overfit or various metric decreases

config, UVA

7.3. Hyperparameters and Optimizer CloudMask

Model

Name

Constraint

Definition

Reference Configuration

CloudMask

epochs

value > 0

Number of epochs

config

CloudMask

learning_rate

value > 0.0. Example: 0.001

Learning rate

config

CloudMask

batch_size

value > 0. Example: 32

Batch size

config

CloudMask

MIN_SST

value > 273.15

Min allowable Sea Surface Temperature

config

CloudMask

PATCH_SIZE

value = 256

Size of image patches

config

CloudMask

seed

value = 1234

Random seed

config

7.4. Hyperparameters and Optimizer STEMDL Classification

Model

Name

Constraint

Definition

Reference Configuration

STEMDL

num_epochs

value > 0

Number of epochs

config

STEMDL

learning_rate

value > 0.0. Example: 0.001

Learning rate

config

STEMDL

batch_size

value > 0.Example: 32

Batch size

config

7.5. Hyperparameters and Optimizer CANDLE UNO

Model

Name

Constraint

Definition

Reference Configuration

CANDLE UNO

num_epochs

value > 0

Number of epochs

CANDLE UNO

learning_rate

value > 0.0. Example: 0.001

Learning rate

CANDLE UNO

batch_size

value > 0.Example: 32

Batch size

8. Run Results

MLCommons® Science Benchmark Suite is focused on the accuracy of benchmarks. Other benchmark metrics can also be submitted, provided that all metrics are sufficiently described to support reproducibility. Submitted results should be announced in the MLCommons Science Blog.

9. Benchmark Results

It is sufficient to submit a single benchmark result is but must be documented how it was achieved. For example, if the result is the best of N runs, this must be clearly documented in the submission.

The results are tared and submitted through the MLCommons® submission process.

10. Identifying Information

To identify a benchmark user must add the following information at the beginning of the submission (We use here an example for the Earthquake Benchmark:

name: Earthquake
user: Gregor von Laszewski
e-mail: [email protected]
organisation:  University of Virginia
division: BII
status: submission
platform: rivanna shared memory

This can easily be achieved through a configuration file and inclusion into the benchmark with the MLcommons® logging library.

11. Contribution

We expect that over time additional benchmarks will be contributed. At this time we have adopted the following best practice for contribution:

  1. The initial benchmark is hosted on a group-accessible GitHub repository, where members have full access rights. These may be different repositories. Currently, we have one repository at [10].

  2. New version will first be made available in that group repository while using branching.

  3. A new candidate version is created and merged into main.

  4. The candidate version is internally tested by the group members to evaluate expected behavior.

  5. Once passed, the code is uploaded to the MLCommons® Science GitHub Repository [9].

  6. Announcements are made to solicit submissions.

  7. Submissions are checked and integrated according to the MLCommons® rules and policies.

The links to the current development repositories are as follows:

Problem

MLCommons® Repository

Development Repository

EarthQuake

link

link

CloudMask

link

link

STEMDL

link

link

CANDLE UNO

link

link

12. Logging Libraries

Augmentation of codes for consideration into the inclusion of the science benchmarks must use the

An alternative library that internally produces MLCommons® events for logging is the

This library has the advantage of generating a human-readable summary table in addition to the MLCommons® log events.

13. Directory Structure for submission

The information for submission is avalable in the document

References

We included here a list of supporting and related documents