-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CM fails to build DLRMv2 99 #92
Comments
DLRM docker container needs criteo dataset to be preprocessed outside of it. We need to add this option in the documentation page but if you have the preprocessed data we can tell you how to use it. @anandhu-eng we can sync on how to add this option in the documentation page. |
Huh, ok. I thought I saw it pulling down the full dataset, but I may have been mistaken. I'm working on a lot in parallel at the moment. :) |
Currently we only support plugging in the preprocessed data as the download of criteo stopped working without manual intervention. I believe we can share you the preprocessed data - doing preprocessing is heavy - needs 6.4 TB disk space and 600 GB+ of memory and around 3 days of running. The preprocessed data is less than 300 GB. We can share it by end of this week - needs to test it for expected accuracy. |
Great, thank you. |
Hi @WarrenSchultz MLCommons has just made available the preprocessed dataset for DLRMv2. Its about 150GB download. We no longer needs TBs of disk space and days of waiting. |
That's great news, thanks! I'll give it a try soon. |
I am trying now to go through the steps in the documentation and have to report two issues related to this thread: (1) The commands in https://docs.mlcommons.org/inference/benchmarks/recommendation/dlrm-v2/#__tabbed_5_2 have the option "--model=dlrm_v2-99" but this should be "--model=dlrm-v2-99" in order to work (2) After fixing this the command proceeds with its several steps and reaches the downloading of the dataset (the preprocessed data that you provided) and when it completes to download the "day_23_sparse_multi_hot.npz" to 100% it does not proceed. I have killed the command and changed the partial downloaded name to the expected name and re-ran the command but now I get a core dump, so I am wondering if there is something missing that was not completed with the download or the core dump is another issue. The error output is:
|
@pdrtrncs23 Thank you for reporting the issue with documentation. This should be solved if you do Please do |
Closing this issue. Feel free to reopen if issue persists. |
Tried running both the command to run it via a docker container, and also running it within the ResNet50 container.
End of the log follows
The text was updated successfully, but these errors were encountered: