Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix primary metric for training #7

Open
atheurer opened this issue Aug 5, 2024 · 0 comments
Open

Fix primary metric for training #7

atheurer opened this issue Aug 5, 2024 · 0 comments
Assignees
Labels

Comments

@atheurer
Copy link
Contributor

atheurer commented Aug 5, 2024

Currently, training metric is taken from training_params_and_metrics_global0.jsonl, getting each vale of "average_throughput" from a sample like:

{"epoch": 0, "step": 279, "rank": 0, "loss": 0.010574680753052235, "overall_throughput": 9.970008352307522, "lr": 1.5584415584415585e-07, "cuda_mem_allocated": 13.605413436889648, "cuda_malloc_retries": 0, "num_loss_counted_tokens": 6734, "batch_size": 32, "total_loss": 0.49789267778396606, "gradnorm": 1.8222006559371948, "weight_norm": 456.3302917480469, "timestamp": "2024-08-02T20:02:47.758635"}

This value, however, is not representative of the overall samples per sec, but supposedly a samples per second for GPU 0 (rank: 0 in the sample json), and this value does not match very closely (only somewhat closely) to the actual metric we want, which is "CurrSamplesPerSec", not found in this file, but in the stdout of the training process.

The stdout of the training process is filtered by the ilab-client script and written to ilab-client-stderrout.txt. This file needs to be post-processed and occurrences of this metric logged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Queued
Development

No branches or pull requests

1 participant