-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HPC] Proposal: Exclude data movement from timing #507
Comments
My comments:
|
I believe it is essential to include the data movement in this benchmark suite to distinguish it from MLPerf training ones since HPC applications typically involve large datasets that stress the I/O. In addition, this can be studied in detail in the storage group in the next year or so. How about the data movement be included in the time-to-train (strong-scaling mode) and may be excluded in the throughput (weak-scaling mode)? I agree that we should still report the data movement timing which can be excluded from the final metric. |
Differentiating from MLPerf-T should not be sufficient justification for policy development. It is true that this will blur the line between MLPerf-T and MLPerf-HPC even more. But the truth is that the two benchmarks are indeed very similar. Including data movement (even partly - only for closed division) does not alleviate the high cost of submission, which is the most common feedback we received (by a very wide margin). Please remember that the motivation for these proposals is to increase MLPerf-HPC's popularity and participation. Having an optional report of data movement timing still adds complexity, this time in parsing the results. Describing data movement optimization strategies in READMEs can be a good middle ground. The MLPerf-Storage approach might be the best solution and also results in cleanly separated scopes to the various MLCommons suites. |
Hi @nvaprodromou are potential submitters willing to guarantee that they would submit to a version without storage? I have been pondering and this is a difficult trade-off - fidelity vs. submission quantity. Is there a way we can de-risk it? How would we feel if we drop the storage and data movement and then no additional submitters appear? |
I don't think we can get any formal guarantees on this. I did ask this question myself as well and I can dig into it some more. But I doubt we'll get any commitments. Even if we do, these are still likely to be NVIDIA submissions, which only solves part of the problem (participation and competition need to rise). Furthermore, changing the rules by itself is not going to change things. We'll need to run some sort of campaign to "advertise" that (I'm making numbers up:) submissions are now 100x easier than they used to be, results (i.e., return on investment) have a guaranteed lifespan, and (this is primarily for businesses) results are more useful to entities that seek to purchase an HPC system. Even though no guarantees can be made, easier submissions, guaranteed returns, and a good campaign can't really hurt the existing participation numbers. On the other hand, if we change rules and no additional submitters appear, I would argue we are at the same place we were before: Even though the quality of the benchmark was reduced compared to v2.0, the primary problem still remains to attract participation and competition. We can have a shiny thing few care about, or a less shiny thing few care about. |
This was accepted and implemented in the rules, so I think it can be closed now. Correct, @nvaprodromou ? |
Introduction:
After collecting feedback from engineers, clients, and press, NVIDIA presented a list of proposals that aim to improve the popularity of the MLPerf HPC benchmark suite. Please see our slide deck for more information on our feedback gathering process and insights.
Proposal: Exclude data movement from timing (start clock after data retrieval, before caching. Same as MLPerf-T).
Slide 14 in proposals slide deck.
This proposal aims to improve the popularity of the MLPerf HPC benchmark suite by improving on the following aspects:
Note: We strongly believe that the filesystem is extremely important part and we always advice potential clients to consider the interplay of all parts of a system (FS + compute + network). However, we received a strong signal from some clients that it makes it harder to use the MLPerf-HPC scores for apples-to-apples comparisons, as FS and compute are sometimes not purchased at the same time.
Discussion
Pros:
Cons:
The text was updated successfully, but these errors were encountered: