Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

Open
DickJC123 opened this issue Nov 12, 2021 · 4 comments
Open

oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

DickJC123 opened this issue Nov 12, 2021 · 4 comments

Comments

@DickJC123
Copy link
Contributor

Description

Here are two independent PR's with the failure:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20635/38/pipeline
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20734/5/pipeline

The failure has been reported as an issue with the mirrors supplying oneapi: https://community.intel.com/t5/Registration-Download-Licensing/OneAPI-apt-repository-broken/m-p/1329104

I'm a little suspicious there might be more to it based on 2 observations:

  1. The onednn lib is installed by a RUN command in Dockerfile.build.ubuntu. This creates an intermediate docker image that is pulled in from cache in the failing builds:
[2021-11-11T23:00:22.939Z] Step 5/20 : RUN export DEBIAN_FRONTEND=noninteractive ...
[2021-11-11T23:00:23.196Z]  ---> Using cache
[2021-11-11T23:00:23.196Z]  ---> 1a09ef0af63e

The image tag is the same as we've seen for a week or more, well before apparent changes to the mirrors. So are we not handling cached docker images properly?

  1. The actual error is in a apt-get update performed by a later RUN command that is installing tensor-rt and cudnn. Perhaps the intel repo used to install onednn in the earlier RUN command should be removed from the container in that same step, since the installation is complete? It's possible that the command add-apt-repository -r "deb https://apt.repos.intel.com/oneapi all main" would perform that action. If the intel repo were no longer in /etc/apt/sources.list, presumably the currently failing apt-get update would succeed.

Error Message

[2021-11-11T23:00:39.105Z] Err:9 https://apt.repos.intel.com/oneapi all/main all Packages
[2021-11-11T23:00:39.105Z]   Hash Sum mismatch
[2021-11-11T23:00:39.105Z]   Hashes of expected file:
[2021-11-11T23:00:39.105Z]    - Filesize:21072 [weak]
[2021-11-11T23:00:39.105Z]    - SHA512:7082767f95f6e40ad31deb8a9df205fa726ef3f4821ff6982d507f2f91adb57c282d1fbe3253f610b3e07f77a0c3c2320ed2c78b8d4b5b648928dd5c1fea271e
[2021-11-11T23:00:39.105Z]    - SHA256:7e91d4ace2815407f999e88e5296f678447b9577e1f84af4addc7212c8eb32b0
[2021-11-11T23:00:39.105Z]    - SHA1:53e523680f4f09015f82673434772a6ec112e8f2 [weak]
[2021-11-11T23:00:39.105Z]    - MD5Sum:3f125fa13d509dd4e66fa49ae3d5af96 [weak]
[2021-11-11T23:00:39.105Z]   Hashes of received file:
[2021-11-11T23:00:39.105Z]    - SHA512:5af0e2266d2ef7cfd42b907c68d21b020e8e1f6c516e9fb35c7affcd52d047ffedec885f14685eaf6539edfc23c0da8e9c7035bcede483a331d9c66e5dce8c54
[2021-11-11T23:00:39.105Z]    - SHA256:97bb376982553d6f5ae07c29a79fd653295caf7599cd6deb3c051c90a0290af1
[2021-11-11T23:00:39.105Z]    - SHA1:9e1ac9d3f961d4e376cbc55758a334cc158a9603 [weak]
[2021-11-11T23:00:39.105Z]    - MD5Sum:db23233f3ef8572c745ff537a2b2fdb8 [weak]
[2021-11-11T23:00:39.105Z]    - Filesize:21072 [weak]
[2021-11-11T23:00:39.105Z]   Last modification reported: Tue, 05 Oct 2021 04:38:36 +0000

To Reproduce

Have not repro'd outside of CI runs.

Steps to reproduce

What have you tried to solve it?

I was not able to repro the failure using the recipe posted to the intel site, i.e. it worked fine for me.

Environment

@DickJC123
Copy link
Contributor Author

I created a PR to experiment with possible fixes to this problem. An edit to the docker "step 5" RUN line that installs oneapi forced the execution of that command, and the log shows a similar hash mismatch error: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/PR-20739/2/pipeline/39/

I conclude it's probably a problem with the public mirror of oneapi. Thoughts @TaoLv?

@TaoLv
Copy link
Member

TaoLv commented Nov 12, 2021

Thank you for reporting the issue, @DickJC123. Actually the apt source was added to install MKL BLAS library, rather than oneDNN.

Hi @yinghu5 @jingxu10, do you know who is managing the oneAPI apt repository? Thanks.

@DickJC123
Copy link
Contributor Author

DickJC123 commented Nov 15, 2021

It's possible the problem has gone away, since I've been able to get some clean CI runs on a side debug-PR I created (#20739). The only improvement from that work I would suggest is the following line:

https://github.com/apache/incubator-mxnet/blob/705e3d87564a11308ec37c7d0ce07244e14f409c/ci/docker/Dockerfile.build.ubuntu#L103

If there are some mirrors serving up the wrong files, the symptom is that every time one does an apt-get update with the repo in the apt repo list, there's a possibility of hitting a bad mirror server and getting the hash-mismatch error. Thus, I recommend that the docker RUN command that adds the oneapi repo, then installs some packages, should then remove the repo from the apt repo list (seen as "step 5" in the log). That way, when another docker RUN also does an apt-get update (e.g. "step 20" to install tensorrt and cudnn), it won't needlessly reach out to oneapi mirrors.

FYI @josephevans . If you add the line to a PR of yours, you will have to tweek the use of '&&' and '\' in the prior line.

@akarbown
Copy link
Contributor

Thank you for reporting the issue, @DickJC123. Actually the apt source was added to install MKL BLAS library, rather than oneDNN.

Hi @yinghu5 @jingxu10, do you know who is managing the oneAPI apt repository? Thanks.

I've seen the same behavior when testing the last changes connected with Ubuntu dockerfile with updated oneMKL. Then, it was explained as 'Mirror sync in progress' and cleaning cache 'apt-get clean' was going to help. I've requested jira internally for that and can reopen it and add you (@TaoLv, @yinghu5, @jingxu10) to the task.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants