Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial 5.0.0 files and configurations #398

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion mlperf_logging/benchmark_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,14 @@
'minigo': 10,
'resnet': 5,
'ssd': 5,
'retinanet': 5,
'stable_diffusion': 10,
'transformer': 10,
'ncf': 10,
'rnnt': 10,
'unet3d': 40,
'gnn' : 10,
'gnn' : 10,
'rgat': 10,
'llama2_70b_lora': 10,
},

Expand Down Expand Up @@ -131,6 +133,14 @@
'stable_diffusion',
'llama2_70b_lora',
'gnn'
],
'5.0': [
'bert',
'dlrm_dcnv2',
'retinanet',
'stable_diffusion',
'llama2_70b_lora',
'rgat'
]
},

Expand Down
36 changes: 17 additions & 19 deletions mlperf_logging/compliance_checker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ To check a log file for compliance:

python -m mlperf_logging.compliance_checker [--config YAML] [--usage training/hpc] [--ruleset MLPERF_EDITION] FILENAME

By default, 4.1.0 training edition rules are used and the default config is set to `4.1.0/common.yaml`.
By default, 5.0.0 training edition rules are used and the default config is set to `5.0.0/common.yaml`.
This config will check all common keys and enqueue benchmark specific config to be checked as well.
Old training editions, still supported are 4.0.0, 3.1.0, 3.0.0, 2.1.0, 2.0.0, 1.1.0, 1.0.0, 0.7.0 and 0.6.0

Expand All @@ -22,23 +22,21 @@ As log examples use [NVIDIA's training logs](https://github.com/mlperf/training_

### Existing config files for training submissions

4.1.0/common.yaml - currently the default config file, checks common fields complience and equeues benchmark-specific config file
4.1.0/closed_common.yaml - the common rules file for closed submissions. These rules apply to all benchmarks
4.1.0/open_common.yaml - the common rules file for open submissions. These rules apply to all benchmarks
4.1.0/closed_ssd.yaml - Per-benchmark rules, closed submissions.
4.1.0/closed_bert.yaml
4.1.0/closed_dlrm_dcnv2.yaml
4.1.0/closed_gpt3.yaml
4.1.0/closed_gnn.yaml
4.1.0/closed_llama2_70b_lora.yaml
4.1.0/closed_stable_diffusion.yaml
4.1.0/open_ssd.yaml - Per-benchmark rules, open submissions.
4.1.0/open_bert.yaml
4.1.0/open_dlrm_dcnv2.yaml
4.1.0/open_gpt3.yaml
4.1.0/open_gnn.yaml
4.1.0/open_llama2_70b_lora.yaml
4.1.0/open_stable_diffusion.yaml
5.0.0/common.yaml - currently the default config file, checks common fields complience and equeues benchmark-specific config file
5.0.0/closed_common.yaml - the common rules file for closed submissions. These rules apply to all benchmarks
5.0.0/open_common.yaml - the common rules file for open submissions. These rules apply to all benchmarks
5.0.0/closed_retinanet.yaml - Per-benchmark rules, closed submissions.
5.0.0/closed_bert.yaml
5.0.0/closed_dlrm_dcnv2.yaml
5.0.0/closed_rgat.yaml
5.0.0/closed_llama2_70b_lora.yaml
5.0.0/closed_stable_diffusion.yaml
5.0.0/open_retinanet.yaml - Per-benchmark rules, open submissions.
5.0.0/open_bert.yaml
5.0.0/open_dlrm_dcnv2.yaml
5.0.0/open_rgat.yaml
5.0.0/open_llama2_70b_lora.yaml
5.0.0/open_stable_diffusion.yaml

### Existing config files for HPC submissions

Expand Down Expand Up @@ -173,7 +171,7 @@ Tested and confirmed working using the following software versions:
- Python 2.7.12 + PyYAML 3.11
- Python 3.6.8 + PyYAML 5.1
- Python 2.9.2 + PyYAML 5.3.1
- Python 3.9.10 + PyYAML 5.4.1
- Python 3.9.10 + PyYAML 5.5.0

### How to install PyYaML

Expand Down
2 changes: 1 addition & 1 deletion mlperf_logging/compliance_checker/mlp_compliance.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ def get_parser():
parser.add_argument('--usage', type=str, default='training',
choices=usage_choices(),
help='what WG do the benchmarks come from')
parser.add_argument('--ruleset', type=str, default='4.1.0',
parser.add_argument('--ruleset', type=str, default='5.0.0',
choices=rule_choices(),
help='what version of rules to check the log against')
parser.add_argument('--config', type=str,
Expand Down
3 changes: 3 additions & 0 deletions mlperf_logging/compliance_checker/mlp_parser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from .ruleset_310 import parse_file as parse_file_310
from .ruleset_400 import parse_file as parse_file_400
from .ruleset_410 import parse_file as parse_file_410
from .ruleset_500 import parse_file as parse_file_500


def parse_file(filename, ruleset='0.6.0'):
Expand All @@ -31,5 +32,7 @@ def parse_file(filename, ruleset='0.6.0'):
return parse_file_400(filename)
elif ruleset == '4.1.0':
return parse_file_410(filename)
elif ruleset == '5.0.0':
return parse_file_500(filename)
else:
raise Exception(f'Ruleset "{ruleset}" is not supported')
105 changes: 105 additions & 0 deletions mlperf_logging/compliance_checker/mlp_parser/ruleset_500.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
'''
Parses a text MLPerf log into a structured format.
'''

from __future__ import print_function

import collections
import json
import re
import sys
from dataclasses import dataclass

from io import open

@dataclass
class LogLine:
"""Class for keeping track of an item in inventory."""
full_string: str
timestamp: float
key: str
value: str
lineno: int

TOKEN = ':::MLLOG '


def parse_line(line):
if not line.startswith(TOKEN):
return None

return json.loads(line[len(TOKEN):])


def string_to_logline(lineno, string):
''' Returns a LogLine or raises a ValueError '''
m = parse_line(string)

if m is None:
raise ValueError('does not match regex')

args = []
args.append(string) # full string

ts = float(m['time_ms']) # may raise error, e.g. "1.2.3"
# TODO check for weird values
args.append(ts)

args.append(m['key']) # key

j = { 'value': m['value'], 'metadata': m['metadata'] }
args.append(j)

args.append(lineno)
return LogLine(*args)


def parse_file(filename):
''' Reads a file by name and returns list of loglines and list of errors'''
with open(filename, encoding='latin-1') as f:
return parse_generator(f)


def strip_and_dedup(gen):
lines = []
for l in gen:
if TOKEN not in l:
continue
lines.append(re.sub(".*"+TOKEN, TOKEN, l))
return lines



def parse_generator(gen):
''' Reads a generator of lines and returns (loglines, errors)
The list of errors are any parsing issues as a tuple (str_line, error_msg)
'''
loglines = []
failed = []
for lineno, line in enumerate(strip_and_dedup(gen)):
line = line.strip()
try:
ll = string_to_logline(lineno, line)
loglines.append(ll)
except ValueError as e:
failed.append((line, str(e)))
return loglines, failed


if __name__ == '__main__':
if len(sys.argv) != 2:
print('usage: mlp_parser.py FILENAME')
print(' tests parsing on the file.')
sys.exit(1)

filename = sys.argv[1]
lines, errors = parse_file(filename)

print('Parsed {} log lines with {} errors.'.format(len(lines), len(errors)))

if len(errors) > 0:
print('Lines which failed to parse:')
for line, error in errors:
print(' Following line failed: {}'.format(error))
print(line)

Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
- KEY:
NAME: global_batch_size
REQ: EXACTLY_ONE
POST: >
s['global_batch_size'] = v['value']

- KEY:
NAME: opt_base_learning_rate
REQ: EXACTLY_ONE

- KEY:
NAME: opt_lamb_epsilon
REQ: EXACTLY_ONE

- KEY:
NAME: opt_learning_rate_training_steps
REQ: EXACTLY_ONE

- KEY:
NAME: opt_learning_rate_warmup_steps
REQ: EXACTLY_ONE

- KEY:
NAME: num_warmup_steps
REQ: EXACTLY_ONE

- KEY:
NAME: start_warmup_step
REQ: EXACTLY_ONE

- KEY:
NAME: opt_lamb_beta_1
REQ: EXACTLY_ONE

- KEY:
NAME: opt_lamb_beta_2
REQ: EXACTLY_ONE

- KEY:
NAME: opt_lamb_weight_decay_rate
REQ: EXACTLY_ONE

- KEY:
NAME: eval_accuracy
REQ: AT_LEAST_ONE
CHECK:
- "'epoch_num' in v['metadata']"
ATLEAST_ONE_CHECK: "(v['value'] >= 0.720) and v['value'] < 1.0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

- KEY:
NAME: submission_benchmark
REQ: EXACTLY_ONE
CHECK: " v['value'] in ['retinanet', 'stable_diffusion', 'dlrm_dcnv2', 'bert', 'rgat', 'llama2_70b_lora'] "
POST: " enqueue_config('training_4.1.0/closed_{}.yaml'.format(v['value'])) "

- KEY:
NAME: gradient_accumulation_steps
REQ: EXACTLY_ONE
CHECK: " v['value'] > 0 "
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
- KEY:
NAME: global_batch_size
REQ: EXACTLY_ONE

- KEY:
NAME: opt_name
REQ: EXACTLY_ONE
CHECK: " v['value'] == 'adagrad' "

- KEY:
NAME: opt_base_learning_rate
REQ: EXACTLY_ONE

- KEY:
NAME: opt_adagrad_learning_rate_decay
REQ: EXACTLY_ONE
CHECK: " v['value'] == 0 "

- KEY:
NAME: opt_weight_decay
REQ: EXACTLY_ONE
CHECK: " v['value'] == 0 "

- KEY:
NAME: opt_adagrad_initial_accumulator_value
REQ: EXACTLY_ONE
CHECK: " v['value'] == 0 "

- KEY:
NAME: opt_adagrad_epsilon
REQ: EXACTLY_ONE
CHECK: " v['value'] == 1e-8 "

- KEY:
NAME: opt_learning_rate_warmup_steps
REQ: EXACTLY_ONE
CHECK: " v['value'] == 0 "

- KEY:
NAME: opt_learning_rate_decay_start_step
REQ: EXACTLY_ONE
CHECK: " v['value'] == 0 "

- KEY:
NAME: opt_learning_rate_decay_steps
REQ: EXACTLY_ONE
CHECK: " v['value'] == 0 "

- KEY:
NAME: eval_accuracy
REQ: AT_LEAST_ONE
CHECK:
- "'epoch_num' in v['metadata']"
ATLEAST_ONE_CHECK: "v['value'] >= 0.80275 and v['value'] <= 1.0"

- KEY:
NAME: eval_samples
REQ: EXACTLY_ONE
CHECK: " v['value'] == 89137319 "
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
- KEY:
NAME: global_batch_size
REQ: EXACTLY_ONE
POST: >
s['global_batch_size'] = v['value']

- KEY:
NAME: opt_base_learning_rate
REQ: EXACTLY_ONE


- KEY:
NAME: opt_learning_rate_training_steps
REQ: EXACTLY_ONE

- KEY:
NAME: opt_gradient_clip_norm
REQ: EXACTLY_ONE

- KEY:
NAME: opt_adamw_weight_decay
REQ: EXACTLY_ONE

- KEY:
NAME: gradient_accumulation_steps
REQ: EXACTLY_ONE

- KEY:
NAME: lora_alpha
REQ: EXACTLY_ONE

- KEY:
NAME: lora_rank
REQ: EXACTLY_ONE
CHECK: " v['value'] == 16"

- KEY:
NAME: eval_accuracy
REQ: AT_LEAST_ONE
CHECK:
- "'samples_count' in v['metadata']"
ATLEAST_ONE_CHECK: "(v['value'] <= 0.925) and v['value'] > 0.0"
Loading
Loading