Merge pull request #761 from mlcommons/dev

Dev -> main
mlcommons · Apr 24, 2024 · ddf5efc · ddf5efc
2 parents 698e945 + ccd9fbb
commit ddf5efc
Show file tree

Hide file tree

Showing 7 changed files with 42 additions and 25 deletions.
diff --git a/CALL_FOR_SUBMISSIONS.md b/CALL_FOR_SUBMISSIONS.md
@@ -17,7 +17,6 @@ Submissions can compete under two hyperparameter tuning rulesets (with separate
 - **Registration deadline to express non-binding intent to submit: February 28th, 2024**.\
 Please fill out the (mandatory but non-binding) [**registration form**](https://forms.gle/K7ty8MaYdi2AxJ4N8).
 - **Submission deadline: April 04th, 2024** *(moved by a week from the initial March 28th, 2024)*
-- **Deadline for self-reporting preliminary results: May 28th, 2024**
 - [tentative] Announcement of all results: July 15th, 2024
 
 For a detailed and up-to-date timeline see the [Competition Rules](/COMPETITION_RULES.md).

diff --git a/COMPETITION_RULES.md b/COMPETITION_RULES.md
@@ -43,7 +43,6 @@ The Competition begins at 12:01am (ET) on November 28, 2023 and ends at 11:59pm
 
 - **Intention to Submit.** You must register your Intention to Submit no later than 11:59pm ET on February 28, 2024.
 - **Submission Period.** You must complete your Submission and enter it after the Intention to Submit deadline, but no later than 11:59pm ET on April 04, 2024.
-- **Deadline for self-reporting results.** 11:59pm ET on May 28, 2024.
 
 ## Agreement to Official Rules
 
@@ -65,8 +64,6 @@ There are four (4) steps to a successful submission ("Submission").
 
    The form is sent to the working group chairs, who will process your Submission. Failure to complete the proper Submission Forms will results in disqualification of your Submission. At the close of the Submission Period, your GitHub repository must be public.
 
-4. **Report Results.** Prior to the Deadline for self-reporting results, run your Submission on either the qualification set or the full benchmark set and report the results. You must report your scores by uploading all unmodified logs that the benchmarking codebase automatically generates in a separate `/results` directory within the `/submission` folder of your Submission's GitHub repository.
-
 ## Submission Conditions
 
 All Submissions must meet the requirements of the terms contained in these rules, including reliance on new algorithmic or mathematical ideas and concepts, and must not use software engineering approaches in order to increase primitive operations in PyTorch, JAX, their dependencies, the operating systems, or the hardware. By entering, all Team members warrant that their Submission does not infringe any third party's rights, and that Team members have obtained all necessary permissions from all relevant third parties to submit the Submission. If, in the sole discretion of Sponsor, any Submission constitutes copyright or other intellectual property infringement, the Submission will be disqualified. Team must hold all rights through license or ownership to the entire Submission. Team members agree to indemnify Sponsor against any and all claims of infringement from any third party for any use by Sponsor of a Submission. Team members may not be: 1) represented under contract that would limit or impair Sponsor's ability to use the Submission; or 2) are under any other contractual relationship, including but not limited to guild and/or union memberships, that may prohibit them from participating fully in this Competition, or from allowing Sponsor to use royalty-free, the Submission worldwide in all media in perpetuity.

diff --git a/DOCUMENTATION.md b/DOCUMENTATION.md
@@ -400,6 +400,8 @@ Submissions will be scored based on their performance on the [fixed workload](#f
 
 Furthermore, a less computationally expensive subset of the fixed workloads is collected with the [qualification set](#qualification-set). Submitters without enough compute resources to self-report on the full set of fixed and held-out workloads can instead self-report on this smaller qualification set. Well-performing submissions can thereby qualify for computational resources provided by sponsors of the benchmark to be scored on the full benchmark set.
 
+NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.
+
 #### Fixed workloads
 
 The fixed workloads are fully specified with the call for submissions. They contain a diverse set of tasks such as image classification, machine translation, speech recognition, or other typical machine learning tasks. For a single task there might be multiple models and therefore multiple fixed workloads. The entire set of fixed workloads should have a combined runtime of roughly 100 hours on the [benchmarking hardware](#benchmarking-hardware).
@@ -429,6 +431,8 @@ Our scoring procedure uses the held-out workloads only to penalize submissions t
 
 #### Qualification set
 
+NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.
+
 The qualification set is designed for submitters that may not have the compute resources to self-report on the full set of [fixed](#fixed-workloads) and [held-out workloads](#randomized-workloads). They may instead self-report numbers on this smaller qualification set. The best-performing submissions may then qualify for compute sponsorship offering a free evaluation on the full benchmark set and therefore the possibility to win [awards and prizes](/COMPETITION_RULES.md#prizes).
 
 The qualification set consists of the same [fixed workloads](#fixed-workloads) as mentioned above, except for both workloads on *ImageNet*, both workloads on *LibriSpeech*, and the *fastMRI* workload. The remaining three workloads (*WMT*, *Criteo 1TB*, and *OGBG*) form the qualification set. There are no [randomized workloads](#randomized-workloads) in the qualification set. The qualification set of workloads aims to have a combined runtime of roughly 24 hours on the [benchmarking hardware](#benchmarking-hardware).
@@ -449,6 +453,8 @@ All scored runs have to be performed on the benchmarking hardware to allow for a
 - 240 GB in RAM
 - 2 TB in storage (for datasets).
 
+NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.
+
 For self-reported results, it is acceptable to perform the tuning trials on hardware different from the benchmarking hardware, as long as the same hardware is used for all tuning trials. Once the best trial, i.e. the one that reached the *validation* target the fastest, was determined, this run has to be repeated on the competition hardware. For example, submitters can tune using their locally available hardware but have to use the benchmarking hardware, e.g. via cloud providers, for the $5$ scored runs. This allows for a fair comparison to the reported results of other submitters while allowing some flexibility in the hardware.
 
 #### Defining target performance
@@ -571,10 +577,14 @@ on the benchmarking hardware. We also recommend to do a dry run using a cloud in
 
 #### Are we allowed to use our own hardware to self-report the results?
 
+NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.
+
 You only have to use the benchmarking hardware for runs that are directly involved in the scoring procedure. This includes all runs for the self-tuning ruleset, but only the runs of the best hyperparameter configuration in each study for the external tuning ruleset. For example, you could use your own (different) hardware to tune your submission and identify the best hyperparameter configuration (in each study) and then only run this configuration (i.e. 5 runs, one for each study) on the benchmarking hardware.
 
 #### What can I do if running the benchmark is too expensive for me?
 
+NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.
+
 Submitters unable to self-fund scoring costs can instead self-report only on the [qualification set of workloads](/COMPETITION_RULES.md#qualification-set) that excludes some of the most expensive workloads. Based on this performance on the qualification set, the working group will provide - as funding allows - compute to evaluate and score the most promising submissions. Additionally, we encourage researchers to reach out to the [working group](mailto:[email protected]) to find potential collaborators with the resources to run larger, more comprehensive experiments for both developing and scoring submissions.
 
 #### Can I submit previously published training algorithms as submissions?

diff --git a/README.md b/README.md
@@ -27,9 +27,9 @@
 ---
 
 > [!IMPORTANT]
-> Upcoming Deadline:
-> Submission deadline: **April 04th, 2024** (*moved by a week*). \
-> For submission instructions please see [Packaging your Submission Code](/GETTING_STARTED.md#package-your-submission-code) section in the Getting Started document.\
+> Submitters are no longer required to self-report results. 
+> We are currently in the process of evaluating and scoring received submissions.
+> We are aiming to release results by July 15th 2024.
 > For other key dates please see [Call for Submissions](CALL_FOR_SUBMISSIONS.md).
 
 ## Table of Contents <!-- omit from toc -->

diff --git a/algorithmic_efficiency/checkpoint_utils.py b/algorithmic_efficiency/checkpoint_utils.py
@@ -119,7 +119,9 @@ def maybe_restore_checkpoint(framework: str,
 
   else:
     checkpoint_state = latest_ckpt
-    if isinstance(model_params, torch.nn.DataParallel):
+    if isinstance(
+        model_params,
+        (torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)):
       model_params = model_params.module
     model_params.load_state_dict(checkpoint_state['model_params'])
     checkpoint_state['model_params'] = model_params
@@ -196,7 +198,9 @@ def save_checkpoint(framework: str,
     opt_state = jax.device_get(jax_utils.unreplicate(opt_state))
     model_state = jax.device_get(jax_utils.unreplicate(model_state))
   else:
-    if isinstance(model_params, torch.nn.DataParallel):
+    if isinstance(
+        model_params,
+        (torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)):
       model_params = model_params.module
     model_params = model_params.state_dict()
     optimizer_state_dict = {}

diff --git a/algorithmic_efficiency/logger_utils.py b/algorithmic_efficiency/logger_utils.py
@@ -16,6 +16,7 @@
 import GPUtil
 import pandas as pd
 import psutil
+import torch.distributed as dist
 
 from algorithmic_efficiency import spec
 from algorithmic_efficiency.pytorch_utils import pytorch_setup
@@ -43,9 +44,6 @@ def get_log_dir(
     resume_last_run: bool,
     overwrite: bool,
 ) -> Optional[str]:
-  if RANK != 0:
-    return
-
   # Construct path to experiment workload directory.
   experiment_dir = os.path.expanduser(experiment_dir)
   workload_dir_name = f'{workload}_{framework}'
@@ -61,18 +59,25 @@ def get_log_dir(
       logging.info(
           f'Removing existing experiment directory {experiment_path} because '
           '--overwrite was set.')
-      shutil.rmtree(experiment_path)
+      if RANK == 0:
+        shutil.rmtree(experiment_path)
     elif resume_last_run:
       logging.info(
           f'Resuming from experiment directory {experiment_path} because '
           '--resume_last_run was set.')
     else:
-      resume = input(
-          'Found existing experiment dir with the same name: {}. Do you wish '
-          'to resume training from this dir? [y/N]:'.format(experiment_path))
-      if resume.lower() != 'y':
-        sys.exit()
-
+      if RANK == 0:
+        resume = input(
+            'Found existing experiment dir with the same name: {}. Do you wish '
+            'to resume training from this dir? [y/N]:'.format(experiment_path))
+        if resume.lower() != 'y':
+          sys.exit()
+
+  if USE_PYTORCH_DDP:
+    try:
+      dist.barrier()
+    except RuntimeError:
+      sys.exit()
   logging.info(f'Creating experiment directory at {experiment_path}.')
   makedir(experiment_path)
   return experiment_path

diff --git a/submission_runner.py b/submission_runner.py
@@ -316,10 +316,12 @@ def train_once(
     flag_file_name = os.path.join(log_dir, f'flags_{preemption_count}.json')
     logging.info(f'Saving flags to {flag_file_name}.')
     logger_utils.write_json(flag_file_name, flags.FLAGS.flag_values_dict())
-    metrics_logger = logger_utils.set_up_loggers(log_dir,
-                                                 flags.FLAGS,
-                                                 hyperparameters)
-    workload.attach_metrics_logger(metrics_logger)
+    metrics_logger = None
+    if RANK == 0:
+      metrics_logger = logger_utils.set_up_loggers(log_dir,
+                                                   flags.FLAGS,
+                                                   hyperparameters)
+      workload.attach_metrics_logger(metrics_logger)
 
   global_start_time = get_time()
   train_state['last_step_end_time'] = global_start_time
@@ -429,7 +431,7 @@ def train_once(
 
           logging_start_time = get_time()
 
-          if log_dir is not None:
+          if log_dir is not None and RANK == 0:
             metrics_logger.append_scalar_metrics(
                 latest_eval_result,
                 global_step=global_step,
@@ -467,7 +469,7 @@ def train_once(
 
   metrics = {'eval_results': eval_results, 'global_step': global_step}
 
-  if log_dir is not None:
+  if log_dir is not None and RANK == 0:
     metrics_logger.append_scalar_metrics(
         {'score': train_state['accumulated_submission_time']},
         global_step=global_step,