-
Notifications
You must be signed in to change notification settings - Fork 6.8k
oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738
Comments
I created a PR to experiment with possible fixes to this problem. An edit to the docker "step 5" RUN line that installs oneapi forced the execution of that command, and the log shows a similar hash mismatch error: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/PR-20739/2/pipeline/39/ I conclude it's probably a problem with the public mirror of oneapi. Thoughts @TaoLv? |
Thank you for reporting the issue, @DickJC123. Actually the apt source was added to install MKL BLAS library, rather than oneDNN. Hi @yinghu5 @jingxu10, do you know who is managing the oneAPI apt repository? Thanks. |
It's possible the problem has gone away, since I've been able to get some clean CI runs on a side debug-PR I created (#20739). The only improvement from that work I would suggest is the following line: If there are some mirrors serving up the wrong files, the symptom is that every time one does an FYI @josephevans . If you add the line to a PR of yours, you will have to tweek the use of '&&' and '\' in the prior line. |
I've seen the same behavior when testing the last changes connected with Ubuntu dockerfile with updated oneMKL. Then, it was explained as 'Mirror sync in progress' and cleaning cache 'apt-get clean' was going to help. I've requested jira internally for that and can reopen it and add you (@TaoLv, @yinghu5, @jingxu10) to the task. |
Description
Here are two independent PR's with the failure:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20635/38/pipeline
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20734/5/pipeline
The failure has been reported as an issue with the mirrors supplying oneapi: https://community.intel.com/t5/Registration-Download-Licensing/OneAPI-apt-repository-broken/m-p/1329104
I'm a little suspicious there might be more to it based on 2 observations:
The image tag is the same as we've seen for a week or more, well before apparent changes to the mirrors. So are we not handling cached docker images properly?
apt-get update
performed by a later RUN command that is installing tensor-rt and cudnn. Perhaps the intel repo used to install onednn in the earlier RUN command should be removed from the container in that same step, since the installation is complete? It's possible that the commandadd-apt-repository -r "deb https://apt.repos.intel.com/oneapi all main"
would perform that action. If the intel repo were no longer in /etc/apt/sources.list, presumably the currently failingapt-get update
would succeed.Error Message
To Reproduce
Have not repro'd outside of CI runs.
Steps to reproduce
What have you tried to solve it?
I was not able to repro the failure using the recipe posted to the intel site, i.e. it worked fine for me.
Environment
The text was updated successfully, but these errors were encountered: