-
Version 0.5
-
April 3, 2023
-
https://github.com/mlcommons/science/blob/main/policy.adoc
Points of contact:
-
Geoffrey Fox([email protected])
-
Tony Hey ([email protected])
-
Gregor von Laszewski([email protected])
-
Juri Papay ([email protected])
-
Jeyan Thiyagalingam ([email protected])
The MLPerf™ and MLCommons® name and logo are trademarks. In order to refer to a result using the MLPerf™ and MLCommons® name, the result must conform to the letter and spirit of the rules specified in this document. The MLCommons® organization reserves the right to solely determine if a use of its name or logo is acceptable.
Our goal is better science accuracy through benchmarks. Our main metric is the accuracy of the science. However, we will have secondary metrics that report time, space, and resources such as energy and temperature behavior.
The Science WG will use training as the primary benchmark. Here we specify the rules for training submissions.
The Science WG will have an Open divisions and submissions to the divisions must be indicated as "open".
The division submission should focus on improving the accuracy of a benchmark. The stopping criterion will be the value of loss specified in the benchmark. The benchmark can also include system performance as an additional result and give the logging information outlined in MLPerf HPC Rules. Power and temperature measurements may also be supplied.
The Open division submission aims to improve scientific discovery from the dataset specified in the benchmark which will specify one or more scientific measurements to be calculated in the submission. The result will be the value of these specified measurements from the submitted model. This model can be based on the supplied reference model or distinct. Data augmentation is allowed and all hyperparameters can be changed in the reference model if used. The result should be a GitHub (markdown) document starting with a table listing Measurement name, Reference model value, Submitted model value. For benchmarks with more than one measurement, an average difference between submitted and reference measurements should be given. Power and performance values are optional but encouraged in the results document. The resulting document should give enough details on the submitted model and any data augmentation so that the review team can evaluate its scientific soundness. Citations should be included to describe the scientific basis of the approach. Other rules for the Open division are as described in MLCommons® Training Rules, however, there are no special rules for the Open Science division.
To showcase the various aspects of the benchmarks and contrast it to with existing training benchmarks from MLCommons® at the time of writing of this document, we have included Table 1. The first line Non Science Training Closed refers to training conducted by other MLCommons® groups. The rows with Science in the Division show the target attributes for that division so the focus of the benchmarks can easily be contrasted. Each collumn showcases an attribute and how it is different between the divisions and other MLCommons non science benchmarks.
Table 1: Targeted aspects of the MLCommons® Science benchmark.
Division |
Hardware System |
Training data |
Model |
Test data / method |
Primary Metric |
Secondary Metrics (optional) |
Science Open |
Anything |
Variable |
Variable |
Fixed |
"Science Quality", Accuracy, Optional user-defined metrics |
Time, memory, power |
The secondary metrics include
-
Space
-
Time
-
Energy
-
Different datasets
All rules are taken from the MLPerf Training Rules except for those that are overridden here.
HPC working group has a focus on infrastructure and closed division Science working group has a focus on science and the open division.
While in HPC the focus of the benchmarks is on infrastructure, the focus here is on scientific accuracy. Nevertheless, the scientific benchmark applications could be used in some cases for HPC evaluation.
The benchmark suite consists of the benchmarks shown in the following table.
Problem |
Dataset |
Quality Target |
Earthquake Prediction |
Earthquake data from USGS. |
Normalized Nash–Sutcliffe model efficiency (NNSE), |
CloudMask |
Multispectral image data from Sea and Land Surface Temperature Radiometer (SLSTR) instrument. |
convergence target |
STEMDL Classification |
Convergent Beam Electron Diffraction (CBED) patterns. |
The scientific metric for this problem is the top1 classification accuracy and F1-score (the higher the better). The main challenge is to predict 3D geometry from its 3 projections (2D images). Information about the best accuracy so far for this dataset can be found in [4] |
UNO |
Molecular features of tumor cells across multiple data sources. |
Score: |
The Science benchmarks are all considered to be part of an Open division.
Hyperparameters and optimizers may be freely changed. For Science benchmarks this is the most important division as the goal is to improve the science and identify algorithms that optimize the science. For this reason, any algorithm and hyperparameter specification for that algorithm is allowed.
As this may include new algorithms we like to collect them as discussed in the Contribution section.
When specifying new algorithms, please provide us with the set of hyperparameters as defined by the examples given in this document.
Algorithms in the Open Division must be properly documented and archived in a GitHub repository with a tagged version so they can easily be reproduced. To be fully included the code must be archived in the official MLCommons® Science GitHub repository.
As the algorithms provided here can also be used in the open division we place the same rules on them as other algorithms.
Most importantly the scientific accuracy must be measured in the same fashion so that alternative implementations and hyperparameter choices can be compared with each other. Each science application provides a well-defined single or a set of comparative measures to evaluate the scientific accuracy. The measure(s) should be widely accepted by the science community
Algorithms that are not open source do not qualify for the science benchmarks as reproducibility and reviews are limited.
The Open division allows to use different preprocessing, model, and training method as the reference implementation.
Our current collation of benchmarks for the open division includes
Problem |
Repository |
EarthQuake |
|
CloudMask |
|
STEMDL |
|
CANDLE UNO |
Current hyperparameter and optimizer settings are specified in the section Hyperparameters and Optimizer. For anything not explicitly mentioned there, submissions must match the behavior and settings of the reference implementations.
In order to simplify the complex setup for scientific benchmarks, we recommend that all parameters are included in the config file when available. We recommend a YAML format for the config file.
Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.
The data at the start of the benchmark run should reside on a parallel file system that is persistent (>= 1 month, not subject to eviction by other users), can be downloaded to / accessed by the user, and can be shared among users at the facility. Any staging to node-local disk or memory or system burst buffer should be included in the benchmark time measurement.
You must flush/reset the on-node caches prior to running each instance of the benchmark. Due to practicality issues, you are not required to reset off-node system-level caches.
We otherwise follow the training rule Data State at Start of Run on consistency with the reference implementation preprocessing and allowance for reformatting.
It is to be pointed out that the data set itself could be a parameter as part of increasing accuracy in the Open Division. For example including additional data may improve the outcome of the benchmark.
For the open division, we have a number of defined data sets for each benchmark that can be used for obtaining scientific results. This allows us an easier review.
For the open division, we also allow open data sets to be part of the submission if the submitter considers data augmentation achieves better science. The ability for us to review the dataset and instructions for replication will need to be supplied by the submitter. We will be introducing unique identifiers for the model and data to allow convenient identification of the input data and models.
All benchmark sources are contained in a GitHub repository and a tagged version is provided for all benchmarked applications. In addition, all data will be using a tagging mechanism and will be part of the benchmark submission. If the data fits in GitHub we will be using GitHub. Otherwise, we will be placing it in a data archive that is openly accessible.
We support the DataPerf MLCommons® working group studies to integrate such identifiers and when available will evaluate their integration.
Our focus is the training of data, but it may take considerable effort to prepare the data for the training loop. Such preparation and their performance is integrated into the benchmark.
Each application has its own hyperparameters and optimizer configurations. They can be controlled with the parameters listed for each application.
Model |
Name |
Constraint |
Definition |
Reference Configuration |
Earthquake |
TFTTransformerepochs |
|
num_epochs |
|
Earthquake |
TFTTransformerbatch_size |
|
batch size to split training data into batches used to calculate model error and update model coefficients |
|
Earthquake |
TFTTransformertestvalbatch_size |
|
this is a range between min and max for batch size |
|
Earthquake |
TFTd_model |
|
number of hidden layers in model |
|
Earthquake |
Tseq |
|
num of encoder steps. The size of sequence window, number of days included in that section of data |
|
Earthquake |
TFTdropout_rate |
|
dropout rate: the dropout rate when training models to randomly drop nodes from a neural network to prevent overfitting |
|
Earthquake |
learning_rate |
|
how quickly the model adapts to the problem, larger means faster convergence but less optimal solutions, slower means slower convergence but more optimal solutions potentially fail if the learning rate is too small. In general, a variable learning rate is best. start larger and decrease as you see fewer returns or as your solution converges. |
|
Earthquake |
early_stopping_patience |
|
Early stopping param for Keras, a way to prevent overfit or various metric decreases |
Model |
Name |
Constraint |
Definition |
Reference Configuration |
CloudMask |
epochs |
|
Number of epochs |
|
CloudMask |
learning_rate |
|
Learning rate |
|
CloudMask |
batch_size |
|
Batch size |
|
CloudMask |
MIN_SST |
|
Min allowable Sea Surface Temperature |
|
CloudMask |
PATCH_SIZE |
|
Size of image patches |
|
CloudMask |
seed |
|
Random seed |
Model |
Name |
Constraint |
Definition |
Reference Configuration |
STEMDL |
num_epochs |
|
Number of epochs |
|
STEMDL |
learning_rate |
|
Learning rate |
|
STEMDL |
batch_size |
|
Batch size |
MLCommons® Science Benchmark Suite is focused on the accuracy of benchmarks. Other benchmark metrics can also be submitted, provided that all metrics are sufficiently described to support reproducibility. Submitted results should be announced in the MLCommons Science Blog.
It is sufficient to submit a single benchmark result is but must be documented how it was achieved. For example, if the result is the best of N runs, this must be clearly documented in the submission.
The results are tared and submitted through the MLCommons® submission process.
To identify a benchmark user must add the following information at the beginning of the submission (We use here an example for the Earthquake Benchmark:
name: Earthquake user: Gregor von Laszewski e-mail: [email protected] organisation: University of Virginia division: BII status: submission platform: rivanna shared memory
This can easily be achieved through a configuration file and inclusion into the benchmark with the MLcommons® logging library.
We expect that over time additional benchmarks will be contributed. At this time we have adopted the following best practice for contribution:
-
The initial benchmark is hosted on a group-accessible GitHub repository, where members have full access rights. These may be different repositories. Currently, we have one repository at [10].
-
New version will first be made available in that group repository while using branching.
-
A new candidate version is created and merged into main.
-
The candidate version is internally tested by the group members to evaluate expected behavior.
-
Once passed, the code is uploaded to the MLCommons® Science GitHub Repository [9].
-
Announcements are made to solicit submissions.
-
Submissions are checked and integrated according to the MLCommons® rules and policies.
The links to the current development repositories are as follows:
Problem |
MLCommons® Repository |
Development Repository |
EarthQuake |
||
CloudMask |
||
STEMDL |
||
CANDLE UNO |
Augmentation of codes for consideration into the inclusion of the science benchmarks must use the
An alternative library that internally produces MLCommons® events for logging is the
This library has the advantage of generating a human-readable summary table in addition to the MLCommons® log events.
The information for submission is avalable in the document
We included here a list of supporting and related documents
-
[1] Overview presentation of the MLScience Group Barrett, Wahid Bhimji, Bala Desinghu, Murali Emani, Geoffrey Fox, Grigori Fursin, Tony Hey, David Kanter, Christine Kirkpatrick,Hai Ah Nam, Juri Papay, Amit Ruhela, Mallikarjun Shankar, Jeyan Thiyagalingam Aristeidis Tsaris, Gregor von Laszewski, Feiyi Wang, Junqi Yin , MLCommons® Community Meeting, (also available in Google docs), December 9 2021.
-
[2] AI Benchmarking for Science: Efforts from the MLCommons® Science Working Group, Jeyan Thiyagalingam, Gregor von Laszewski, Junqi Yin, Murali Emani, Juri Papay, Gregg Barrett, Piotr Luszczek, Aristeidis Tsaris, Christine Kirkpatrick, Feiyi Wang, Tom Gibbs, Venkatram Vishwanath, Mallikarjun Shankar, Geoffrey Fox, Tony Hey, June 2022
-
[3] Earthquake Nowcasting with Deep Learning, Fox, G., Rundle, J., Donnellan, A., Feng, B., Geohazards 3(2), 199, April 2022
-
[4] Probability Flow for Classifying Crystallographic Space Groups Pan, J., In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham., 2022
-
[10] Science Development GitHub Repository to prepare release candidates for the MLCommons® repository