Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47

Closed
roywei opened this issue Apr 30, 2020 · 29 comments
Closed

Comments

@roywei
Copy link
Contributor

roywei commented Apr 30, 2020

Hi, DJL's TensorFlow engine is depending on tensorflow-core-api' SNAPSHOT package. Our dependencies here: https://github.com/awslabs/djl/blob/master/tensorflow/tensorflow-native-auto/build.gradle#L19
We found out there is an update on tensorflow-core-api SNAPSHOT on 04/28, but the corresponding linux-gpu-mkl.jar is missing, same for windows.

Did the upload failed?
https://oss.sonatype.org/#nexus-search;quick~tensorflow-core-api

we get 404 when trying to download jar, both gradle build and manually trying the following link failed.
https://oss.sonatype.org/service/local/artifact/maven/redirect?r=snapshots&g=org.tensorflow&a=tensorflow-core-api&v=0.1.0-SNAPSHOT&e=jar&c=windows-x86_64

The mac-os-mkl.jar is there, but the libjnimkldnn.dylib and libiomp5.dylib extra libraries are missing, is this intended? how can I find them? We rely on this task to download native dependencies automatically for users based on their platform.
Please help take a look, thank you so much!

image

image(1)

@frankfliu
Copy link
Contributor

I think may related to CI build failure: https://github.com/tensorflow/java/actions
the snapshot is partially uploaded.

@saudet
Copy link
Contributor

saudet commented Apr 30, 2020

Yes, as per these threads mentioned at the last meeting, the builds on Linux are no longer reliable:

That is in addition to this issue preventing builds for Windows from succeeding:

If you could intervene on behalf of Amazon via your GitHub support channel, please do.

@karllessard
Copy link
Collaborator

karllessard commented Apr 30, 2020

The mac-os-mkl.jar is there, but the libjnimkldnn.dylib and libiomp5.dylib extra libraries are missing, is this intended

Yes it is intended, we don't distribute them in our artifacts, you need to have them installed on your machine or add dependencies to other JavaCPP artifacts to upload them via Maven/Gradle, like this one: https://github.com/bytedeco/javacpp-presets/tree/master/mkl-dnn. (@saudet correct me if I'm wrong)

@saudet
Copy link
Contributor

saudet commented May 1, 2020

The idea behind having MKL-DNN in a different package is because it's over 100 MB in size (all platforms) and these binaries are used by other libraries, with which they can be shared, for example, MXNet: https://github.com/bytedeco/javacpp-presets/blob/master/mxnet/platform/pom.xml#L33-L37

The same applies to most popular native libraries used by other native libraries such as OpenBLAS, OpenCV, FFmpeg, Arrow, HDF5, LLVM, MKL, CUDA/cuDNN/NCCL/TensorRT, etc:
https://github.com/bytedeco/javacpp-presets/tree/master/openblas
https://github.com/bytedeco/javacpp-presets/tree/master/opencv
https://github.com/bytedeco/javacpp-presets/tree/master/ffmpeg
https://github.com/bytedeco/javacpp-presets/tree/master/arrow
https://github.com/bytedeco/javacpp-presets/tree/master/hdf5
https://github.com/bytedeco/javacpp-presets/tree/master/llvm
https://github.com/bytedeco/javacpp-presets/tree/master/mkl
https://github.com/bytedeco/javacpp-presets/tree/master/cuda
https://github.com/bytedeco/javacpp-presets/tree/master/tensorrt

I talk about the need of a distribution that would allow us to share components like that in this post:
http://bytedeco.org/news/2019/01/11/importance-of-a-distribution/
This is going to be crucial for Java applications of machine learning to succeed in the enterprise.

@saudet
Copy link
Contributor

saudet commented May 13, 2020

Looks like Linux builds are back up:
actions/runner-images#709 (comment)
We're now getting 13 GB of free space after CUDA, etc:
https://github.com/tensorflow/java/runs/669404530

@saudet
Copy link
Contributor

saudet commented May 13, 2020

It looks like we have a solution for Windows as well:
actions/runner-images#785 (comment)
Let me try that right away...

@saudet
Copy link
Contributor

saudet commented May 13, 2020

The CPU builds for Windows will also now work, but not for CUDA, yet (see pull #54).

@saudet
Copy link
Contributor

saudet commented May 14, 2020

Ok, builds for Linux and Mac are back up:
https://github.com/tensorflow/java/runs/672994641
Windows is still a WIP though...

@karllessard
Copy link
Collaborator

Still, that is kind of refreshing to see all these green checks on that page, thanks @saudet !

@karllessard
Copy link
Collaborator

karllessard commented May 14, 2020

@saudet , I've rebase the TF2.2.0 branch with your last changes it did not went that well: https://github.com/tensorflow/java/runs/674534254?check_suite_focus=true

From what I understand, the Mac build got interrupted because the Mac + MKL build failed first, seems like something have changed in TF2.2.0 that I need to look at. Same thing for the Windows build, who got interrupted because Windows + MKL build also failed first (that one was expected though).

So I'm wondering if we can prevent the builds from being interrupted if other builds for the same platform are failing, at least for the "vanilla" ones so we could deliver them since they are more stable than their MKL/GPU flavours?

@karllessard
Copy link
Collaborator

...at the same time, only Windows MKL/GPU artifacts are expected to fail now, so it's fine for other platforms to stop if something goes wrong...

Then maybe we just need to remove back the if ERRORLEVEL 1 exit /b condition check if there is no way to break the dependencies between the different builds for Windows (or we can simply disable the MKL/GPU build for Windows until we fix them)

@saudet
Copy link
Contributor

saudet commented May 14, 2020 via email

@saudet
Copy link
Contributor

saudet commented May 14, 2020 via email

@saudet
Copy link
Contributor

saudet commented May 14, 2020 via email

@karllessard
Copy link
Collaborator

To work around this, make your PRs smaller. Making 1000 different changes
in the same PR makes it difficult to debug anyway.

I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.

Going back to this issue, I suggest then that we disable MKL/GPU builds for Windows until we know we can build them successfully. Then we can really rely on the CI build check status, which we unfortunately got used to ignore. I can make that change in my actual PR.

@karllessard
Copy link
Collaborator

Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

If not, I’ll give it a try

@saudet
Copy link
Contributor

saudet commented May 15, 2020

I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes.

I mean like with pull #56. Do one "small" change (upgrade TensorFlow and nothing else), see if that works, merge it if it works, then go on with your other changes in another PR.

Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

If not, I’ll give it a try

I didn't notice that, no. Must be new. Let's see if it allows us to run builds for longer than 6 hours. GitHub Actions seems to have taken it without error in pull #56...

@karllessard
Copy link
Collaborator

And it looks like we are not the only one having trouble to build TF2.2.0+MKL on Mac, these guys reported the same error that we hit in our CI build:

ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/mkl_dnn_v1/BUILD.bazel:52:1: C++ compilation of rule '@mkl_dnn_v1//:mkl_dnn' failed (Exit 1)
In file included from external/mkl_dnn_v1/src/cpu/rnn/ref_postgemm_lstm.cpp:21:
external/mkl_dnn_v1/src/common/dnnl_thread.hpp:42:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^~~~~~~

Looks like we need to install more stuff on the Mac machines?

@karllessard
Copy link
Collaborator

So the missing file in MacOSX seems to be fixed now, I brew install libomp before building on the Mac machine. Still, the MKL build fail for Mac and Windows.

But I'm a bit concerned about this comment I found in TensorFlow configuration file:

# Please note that MKL on MacOS or windows is still not supported.
# If you would like to use a local MKL instead of downloading, please set the
# environment variable "TF_MKL_ROOT" every time before build.

So it sounds like we cannot assume that the MKL version coming with TF will work on those platforms for all TF releases and we need to install our own? Here is another related thread.

/cc @saudet

@Craigacp
Copy link
Collaborator

The OpenMP support on MacOS works fine if you have libomp installed via brew, and you add it's location to the header and linker paths (usually /usr/local/include and /usr/local/lib). I had to jump through these hoops a couple of times for onnx-runtime.

@karllessard
Copy link
Collaborator

karllessard commented May 15, 2020

Thanks @Craigacp , any tip on how to install it easily from the command line?

Oh sorry, I misread you, so you said that libomp should have done it, I'll check the include paths or maybe try to set this TF_MKL_ROOT env variable

@Craigacp
Copy link
Collaborator

I patched the CMake build locally so it added things to the include & library paths if running on macOS, after running brew install libomp. I think their CI was already configured in some way I didn't quite understand.

@karllessard
Copy link
Collaborator

But what concerns me here is that the error I'm having is not with missing omp.h header anymore (that, just brew install it did the trick). The error is not that the code does not compile on Mac and Windows because of what seems an incompatibility on those platforms between the MKL library released with TF and the some kernels making use of it:

ERROR: /private/var/tmp/_bazel_runner/17ed05ace81230cf3c69807725c012c8/external/org_tensorflow/tensorflow/core/kernels/BUILD:7897:1: C++ compilation of rule '@org_tensorflow//tensorflow/core/kernels:mkl_conv_op' failed (Exit 1)
In file included from external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.cc:19:
external/org_tensorflow/tensorflow/core/kernels/mkl_conv_ops.h:157:19: error: no viable overloaded '='
      *input_dims = mkldnn_sizes;
      ~~~~~~~~~~~ ^ ~~~~~~~~~~~~

Both platforms have the same error. So just to confirm @Craigacp , you are telling me that playing with include paths may fix this error as well or you were referring to the previous problem with omp.h?

@Craigacp
Copy link
Collaborator

Ah yeah, I meant header and linker issues, I didn't see that error.

@karllessard
Copy link
Collaborator

Yeah, sorry, my previous message was not very explicit (it's still the morning and I did not went through my first coffee yet...). So if you have any tip for that second problem, please let me know!

@karllessard
Copy link
Collaborator

An update on this: I've added the --define build_with_mkl_dnn_v1_only=false flag to the bazel command line and now MacOS + MKL passes, looks like MKL-DNN v1 only works out with Linux.

I was planning to leave it that way if everything goes fine and then we can check how to enable MKL-DNN 1.x on all platforms or only on Linux.

@karllessard
Copy link
Collaborator

Ok, it's unfortunate but it looks like we cannot increase the timeout of a job on GitHub action on their hosted-runners beyond 6 hours, as dictated here. Setting the timeout-minutes parameter to a higher value didn't helped.

So this build (based on TF2.2) is the best we had got so far on this CI solution. So I'll go ahead with merging PR #44 and we will probably need to continue the discussion to find an alternative or a complement to it.

@karllessard
Copy link
Collaborator

Hi @roywei , we have changed a little bit the way we redeploy all the artifacts after building to normalize the timestamp associated to the last snapshots.

Please let us know if you notice any issue with fetching the artifacts again from Gradle (note also that MKL and GPU builds for Windows have been temporarily removed). If everything is fine, let us know as well so we can close safely this issue. Thanks!

@karllessard
Copy link
Collaborator

Ok, I think we can close this one now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants