-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to fetch latest snapshot for tensorflow-core-api linux gpu mkl #47
Comments
I think may related to CI build failure: https://github.com/tensorflow/java/actions |
Yes, as per these threads mentioned at the last meeting, the builds on Linux are no longer reliable:
That is in addition to this issue preventing builds for Windows from succeeding: If you could intervene on behalf of Amazon via your GitHub support channel, please do. |
Yes it is intended, we don't distribute them in our artifacts, you need to have them installed on your machine or add dependencies to other JavaCPP artifacts to upload them via Maven/Gradle, like this one: https://github.com/bytedeco/javacpp-presets/tree/master/mkl-dnn. (@saudet correct me if I'm wrong) |
Looks like Linux builds are back up: |
It looks like we have a solution for Windows as well: |
The CPU builds for Windows will also now work, but not for CUDA, yet (see pull #54). |
Ok, builds for Linux and Mac are back up: |
Still, that is kind of refreshing to see all these green checks on that page, thanks @saudet ! |
@saudet , I've rebase the TF2.2.0 branch with your last changes it did not went that well: https://github.com/tensorflow/java/runs/674534254?check_suite_focus=true From what I understand, the Mac build got interrupted because the Mac + MKL build failed first, seems like something have changed in TF2.2.0 that I need to look at. Same thing for the Windows build, who got interrupted because Windows + MKL build also failed first (that one was expected though). So I'm wondering if we can prevent the builds from being interrupted if other builds for the same platform are failing, at least for the "vanilla" ones so we could deliver them since they are more stable than their MKL/GPU flavours? |
...at the same time, only Windows MKL/GPU artifacts are expected to fail now, so it's fine for other platforms to stop if something goes wrong... Then maybe we just need to remove back the |
It looks like that's just the way GitHub Actions works. I haven't found a
way to disable that "feature".
|
The Windows builds do not fail, they just timeout. Removing an "if failed"
clause that never gets executed won't change anything.
|
To work around this, make your PRs smaller. Making 1000 different changes
in the same PR makes it difficult to debug anyway.
|
I’m not sure to understand that last part, what the length of a PR has to do with the success of a build? Btw, the PR I’m referring to looks larger than it is because we persist the generated files, there are actually just a few changes. Going back to this issue, I suggest then that we disable MKL/GPU builds for Windows until we know we can build them successfully. Then we can really rely on the CI build check status, which we unfortunately got used to ignore. I can make that change in my actual PR. |
Or did we ever tried to increase that timeout value? https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes If not, I’ll give it a try |
I mean like with pull #56. Do one "small" change (upgrade TensorFlow and nothing else), see if that works, merge it if it works, then go on with your other changes in another PR.
I didn't notice that, no. Must be new. Let's see if it allows us to run builds for longer than 6 hours. GitHub Actions seems to have taken it without error in pull #56... |
And it looks like we are not the only one having trouble to build TF2.2.0+MKL on Mac, these guys reported the same error that we hit in our CI build:
Looks like we need to install more stuff on the Mac machines? |
So the missing file in MacOSX seems to be fixed now, I But I'm a bit concerned about this comment I found in TensorFlow configuration file:
So it sounds like we cannot assume that the MKL version coming with TF will work on those platforms for all TF releases and we need to install our own? Here is another related thread. /cc @saudet |
The OpenMP support on MacOS works fine if you have libomp installed via brew, and you add it's location to the header and linker paths (usually |
Oh sorry, I misread you, so you said that |
I patched the CMake build locally so it added things to the include & library paths if running on macOS, after running |
But what concerns me here is that the error I'm having is not with missing
Both platforms have the same error. So just to confirm @Craigacp , you are telling me that playing with include paths may fix this error as well or you were referring to the previous problem with |
Ah yeah, I meant header and linker issues, I didn't see that error. |
Yeah, sorry, my previous message was not very explicit (it's still the morning and I did not went through my first coffee yet...). So if you have any tip for that second problem, please let me know! |
An update on this: I've added the I was planning to leave it that way if everything goes fine and then we can check how to enable MKL-DNN 1.x on all platforms or only on Linux. |
Ok, it's unfortunate but it looks like we cannot increase the timeout of a job on GitHub action on their hosted-runners beyond 6 hours, as dictated here. Setting the So this build (based on TF2.2) is the best we had got so far on this CI solution. So I'll go ahead with merging PR #44 and we will probably need to continue the discussion to find an alternative or a complement to it. |
Hi @roywei , we have changed a little bit the way we redeploy all the artifacts after building to normalize the timestamp associated to the last snapshots. Please let us know if you notice any issue with fetching the artifacts again from Gradle (note also that MKL and GPU builds for Windows have been temporarily removed). If everything is fine, let us know as well so we can close safely this issue. Thanks! |
Ok, I think we can close this one now. |
Hi, DJL's TensorFlow engine is depending on tensorflow-core-api' SNAPSHOT package. Our dependencies here: https://github.com/awslabs/djl/blob/master/tensorflow/tensorflow-native-auto/build.gradle#L19
We found out there is an update on tensorflow-core-api SNAPSHOT on 04/28, but the corresponding linux-gpu-mkl.jar is missing, same for windows.
Did the upload failed?
https://oss.sonatype.org/#nexus-search;quick~tensorflow-core-api
we get 404 when trying to download jar, both gradle build and manually trying the following link failed.
https://oss.sonatype.org/service/local/artifact/maven/redirect?r=snapshots&g=org.tensorflow&a=tensorflow-core-api&v=0.1.0-SNAPSHOT&e=jar&c=windows-x86_64
The mac-os-mkl.jar is there, but the
libjnimkldnn.dylib
andlibiomp5.dylib
extra libraries are missing, is this intended? how can I find them? We rely on this task to download native dependencies automatically for users based on their platform.Please help take a look, thank you so much!
The text was updated successfully, but these errors were encountered: