oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

DickJC123 · 2021-11-12T04:25:10Z

Description

Here are two independent PR's with the failure:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20635/38/pipeline
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20734/5/pipeline

The failure has been reported as an issue with the mirrors supplying oneapi: https://community.intel.com/t5/Registration-Download-Licensing/OneAPI-apt-repository-broken/m-p/1329104

I'm a little suspicious there might be more to it based on 2 observations:

The onednn lib is installed by a RUN command in Dockerfile.build.ubuntu. This creates an intermediate docker image that is pulled in from cache in the failing builds:

[2021-11-11T23:00:22.939Z] Step 5/20 : RUN export DEBIAN_FRONTEND=noninteractive ...
[2021-11-11T23:00:23.196Z]  ---> Using cache
[2021-11-11T23:00:23.196Z]  ---> 1a09ef0af63e

The image tag is the same as we've seen for a week or more, well before apparent changes to the mirrors. So are we not handling cached docker images properly?

The actual error is in a apt-get update performed by a later RUN command that is installing tensor-rt and cudnn. Perhaps the intel repo used to install onednn in the earlier RUN command should be removed from the container in that same step, since the installation is complete? It's possible that the command add-apt-repository -r "deb https://apt.repos.intel.com/oneapi all main" would perform that action. If the intel repo were no longer in /etc/apt/sources.list, presumably the currently failing apt-get update would succeed.

Error Message

[2021-11-11T23:00:39.105Z] Err:9 https://apt.repos.intel.com/oneapi all/main all Packages
[2021-11-11T23:00:39.105Z]   Hash Sum mismatch
[2021-11-11T23:00:39.105Z]   Hashes of expected file:
[2021-11-11T23:00:39.105Z]    - Filesize:21072 [weak]
[2021-11-11T23:00:39.105Z]    - SHA512:7082767f95f6e40ad31deb8a9df205fa726ef3f4821ff6982d507f2f91adb57c282d1fbe3253f610b3e07f77a0c3c2320ed2c78b8d4b5b648928dd5c1fea271e
[2021-11-11T23:00:39.105Z]    - SHA256:7e91d4ace2815407f999e88e5296f678447b9577e1f84af4addc7212c8eb32b0
[2021-11-11T23:00:39.105Z]    - SHA1:53e523680f4f09015f82673434772a6ec112e8f2 [weak]
[2021-11-11T23:00:39.105Z]    - MD5Sum:3f125fa13d509dd4e66fa49ae3d5af96 [weak]
[2021-11-11T23:00:39.105Z]   Hashes of received file:
[2021-11-11T23:00:39.105Z]    - SHA512:5af0e2266d2ef7cfd42b907c68d21b020e8e1f6c516e9fb35c7affcd52d047ffedec885f14685eaf6539edfc23c0da8e9c7035bcede483a331d9c66e5dce8c54
[2021-11-11T23:00:39.105Z]    - SHA256:97bb376982553d6f5ae07c29a79fd653295caf7599cd6deb3c051c90a0290af1
[2021-11-11T23:00:39.105Z]    - SHA1:9e1ac9d3f961d4e376cbc55758a334cc158a9603 [weak]
[2021-11-11T23:00:39.105Z]    - MD5Sum:db23233f3ef8572c745ff537a2b2fdb8 [weak]
[2021-11-11T23:00:39.105Z]    - Filesize:21072 [weak]
[2021-11-11T23:00:39.105Z]   Last modification reported: Tue, 05 Oct 2021 04:38:36 +0000

To Reproduce

Have not repro'd outside of CI runs.

Steps to reproduce

What have you tried to solve it?

I was not able to repro the failure using the recipe posted to the intel site, i.e. it worked fine for me.

Environment

The text was updated successfully, but these errors were encountered:

DickJC123 · 2021-11-12T08:23:21Z

I created a PR to experiment with possible fixes to this problem. An edit to the docker "step 5" RUN line that installs oneapi forced the execution of that command, and the log shows a similar hash mismatch error: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/PR-20739/2/pipeline/39/

I conclude it's probably a problem with the public mirror of oneapi. Thoughts @TaoLv?

TaoLv · 2021-11-12T14:27:29Z

Thank you for reporting the issue, @DickJC123. Actually the apt source was added to install MKL BLAS library, rather than oneDNN.

Hi @yinghu5 @jingxu10, do you know who is managing the oneAPI apt repository? Thanks.

DickJC123 · 2021-11-15T18:55:23Z

It's possible the problem has gone away, since I've been able to get some clean CI runs on a side debug-PR I created (#20739). The only improvement from that work I would suggest is the following line:

https://github.com/apache/incubator-mxnet/blob/705e3d87564a11308ec37c7d0ce07244e14f409c/ci/docker/Dockerfile.build.ubuntu#L103

If there are some mirrors serving up the wrong files, the symptom is that every time one does an apt-get update with the repo in the apt repo list, there's a possibility of hitting a bad mirror server and getting the hash-mismatch error. Thus, I recommend that the docker RUN command that adds the oneapi repo, then installs some packages, should then remove the repo from the apt repo list (seen as "step 5" in the log). That way, when another docker RUN also does an apt-get update (e.g. "step 20" to install tensorrt and cudnn), it won't needlessly reach out to oneapi mirrors.

FYI @josephevans . If you add the line to a PR of yours, you will have to tweek the use of '&&' and '\' in the prior line.

akarbown · 2021-11-22T12:20:53Z

Thank you for reporting the issue, @DickJC123. Actually the apt source was added to install MKL BLAS library, rather than oneDNN.

Hi @yinghu5 @jingxu10, do you know who is managing the oneAPI apt repository? Thanks.

I've seen the same behavior when testing the last changes connected with Ubuntu dockerfile with updated oneMKL. Then, it was explained as 'Mirror sync in progress' and cleaning cache 'apt-get clean' was going to help. I've requested jira internally for that and can reopen it and add you (@TaoLv, @yinghu5, @jingxu10) to the task.

DickJC123 added Bug needs triage labels Nov 12, 2021

DickJC123 mentioned this issue Nov 12, 2021

[WIP] Debug "hash mismatch" issue with oneapi lib #20739

Closed

6 tasks

DickJC123 mentioned this issue Nov 15, 2021

Port convolutions to cuDNN v8 API #20635

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

DickJC123 commented Nov 12, 2021

DickJC123 commented Nov 12, 2021

TaoLv commented Nov 12, 2021

DickJC123 commented Nov 15, 2021 •

edited

Loading

akarbown commented Nov 22, 2021

oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

oneapi build issue "hash sum mismatch" is affecting multiple PR's #20738

Comments

DickJC123 commented Nov 12, 2021

Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

DickJC123 commented Nov 12, 2021

TaoLv commented Nov 12, 2021

DickJC123 commented Nov 15, 2021 • edited Loading

akarbown commented Nov 22, 2021

DickJC123 commented Nov 15, 2021 •

edited

Loading