Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel-optimized TF 1.9, AVX support #5142

Closed
wants to merge 2 commits into from

Conversation

EdwardDixon
Copy link
Contributor

The Intel-optimized version of TensorFlow 1.9 is now the default for Anaconda users. It now supports all processors with AVX - so everything since Sandy Bridge, which was released in 2011. With that in mind, I was thinking we could dispense with two different conda environments and fold everything into the gatk environment. @samuelklee , I'm the new guy on the Intel team you've been dealing with.

@cmnbroad cmnbroad self-assigned this Aug 29, 2018
@cmnbroad
Copy link
Collaborator

cmnbroad commented Aug 29, 2018

@EdwardDixon Thanks for trying this - it would be great if we were able to have a single conda env, but a couple of questions:

  • We'd need to understand the affect of this change on our build times. It looks like the travis builds are failing because the dependency downloads are resulting in so many progress messages that we're exceeding the allowable log length, probably because the download is either large or slow. I'm not sure if thats transient or not.
  • We try to carefully control the size of our (already sizable) docker image. We'll need to understand how this impacts that.
  • @lucidtronix Any thoughts on moving from tensorflow 1.4 to 1.9 ?

@droazen droazen requested a review from cmnbroad August 29, 2018 15:12
@droazen
Copy link
Contributor

droazen commented Aug 29, 2018

@EdwardDixon What is the behavior if AVX is not available? Does it crash/refuse to run, or is there a graceful automatic fallback to non-vectorized code? Without such a fallback in place, I'm not sure we could switch to it exclusively.

@EdwardDixon
Copy link
Contributor Author

No AVX = it'll crash... sounds bad, I admit. But remember we are talking about running a deep neural network over a large dataset: is someone really going to want to do that on hardware that is 8 years old (pre-AVX)? This version of TensorFlow is now the default for all Anaconda users, which in practice probably means a sizeable fraction of the machine learning community, and so having minimum hardware requirements in line with theirs is perhaps not so unreasonable?

Another option would be to change the default: have the gatk enviroment use the accelerated TensorFlow (since almost everyone has AVX, and they can get a 10X or so speedup), but make a second environment available for people that want to try to run a deep neural network on very old hardware - gatk-old?

@droazen
Copy link
Contributor

droazen commented Aug 29, 2018

@EdwardDixon Well, you'd be surprised at some of the hardware we have to deal with. Even some machines here at the Broad don't have AVX. In general, our policy with hardware-dependent optimizations in GATK has been to insist on having a transparent fallback mechanism when the required hardware isn't present -- I'd really prefer not to start making exceptions to that rule. Could the Intel-optimized Tensorflow be patched to fall back to vanilla tensorflow when AVX is not present? Is that an option? Or could it at least be patched to not actually crash in that case?

@EdwardDixon
Copy link
Contributor Author

This sounds like a good rule, in general. In this case though, if users are going to run deep neural networks, there is going to be a substantial computational burden, such that running them is unlikely to appeal to users with older hardware (about 95% of users who train these models use accelerators, for example - one reason Deep Learning didn't really take off till 2014). If you could see your way to making AVX (i.e. 8 year old hardware) the minimum requirement for your default docker image, you would be giving a 10X speedup to almost every user.

@droazen
Copy link
Contributor

droazen commented Aug 30, 2018

@EdwardDixon We have a fair number of GATK users who are stuck with older hardware (including university clusters that they have no power to upgrade), and we can't just cut these users off by imposing such a minimum hardware requirement. The best we can do is to use AVX when it's available, and fall back to slower codepaths when it's not.

Also, actual crashes in native code impose a significant support burden on our comms team, as they are often hard to diagnose and deal with. Things like SIGSEGV or SIGILL are a nightmare for our support staff. At a minimum we'd need a graceful failure with an easy-to-understand error message when AVX is not present rather than a crash, before we could make this the default in GATK.

@ldgauthier
Copy link
Contributor

Aside from the users with old hardware, very few of the GCS zones guarantee processors that support AVX, which would lead to sporadic failures except in central-1f, for example.

@droazen droazen self-requested a review August 30, 2018 18:12
@droazen droazen self-assigned this Aug 31, 2018
@droazen droazen mentioned this pull request Oct 9, 2018
@droazen
Copy link
Contributor

droazen commented Oct 15, 2018

Closing this in favor of #5291

@droazen droazen closed this Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants