Fix primary metric for training #7

atheurer · 2024-08-05T14:19:09Z

Currently, training metric is taken from training_params_and_metrics_global0.jsonl, getting each vale of "average_throughput" from a sample like:

{"epoch": 0, "step": 279, "rank": 0, "loss": 0.010574680753052235, "overall_throughput": 9.970008352307522, "lr": 1.5584415584415585e-07, "cuda_mem_allocated": 13.605413436889648, "cuda_malloc_retries": 0, "num_loss_counted_tokens": 6734, "batch_size": 32, "total_loss": 0.49789267778396606, "gradnorm": 1.8222006559371948, "weight_norm": 456.3302917480469, "timestamp": "2024-08-02T20:02:47.758635"}

This value, however, is not representative of the overall samples per sec, but supposedly a samples per second for GPU 0 (rank: 0 in the sample json), and this value does not match very closely (only somewhat closely) to the actual metric we want, which is "CurrSamplesPerSec", not found in this file, but in the stdout of the training process.

The stdout of the training process is filtered by the ilab-client script and written to ilab-client-stderrout.txt. This file needs to be post-processed and occurrences of this metric logged.

The text was updated successfully, but these errors were encountered:

atheurer added the rhel-ai label Aug 5, 2024

project-crucible-tracking bot added this to Crucible Tracking Aug 5, 2024

project-crucible-tracking bot moved this to Queued in Crucible Tracking Aug 5, 2024

atheurer self-assigned this Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix primary metric for training #7

Fix primary metric for training #7

atheurer commented Aug 5, 2024

Fix primary metric for training #7

Fix primary metric for training #7

Comments

atheurer commented Aug 5, 2024