Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX present? #5291

Merged
merged 7 commits into from
Nov 19, 2018
Merged

AVX present? #5291

merged 7 commits into from
Nov 19, 2018

Conversation

EdwardDixon
Copy link
Contributor

CNNScoreVariant relies on a computationally demanding operation - a deep neural network. Using an Intel-optimized version of TensorFlow gives a 10X improvement in performance (e.g. 50 hours to 5 hours for a typical input). However, these improvements mean that we now have a minimum hardware requirement - the availability of AVX.

@EdwardDixon
Copy link
Contributor Author

@lucidtronix Hi Sam, wonder if you could take a look at this? Adding the "AVX required" message.

@droazen
Copy link
Contributor

droazen commented Oct 9, 2018

@EdwardDixon Isn't this a dupe of #5142? Have you addressed our original concerns from that PR's discussion thread, some of which I've reproduced below?

droazen commented on Aug 30
@EdwardDixon We have a fair number of GATK users who are stuck with older hardware (including
university clusters that they have no power to upgrade), and we can't just cut these users off by 
imposing such a minimum hardware requirement. The best we can do is to use AVX when it's 
available, and fall back to slower codepaths when it's not.

Also, actual crashes in native code impose a significant support burden on our comms team, as they
 are often hard to diagnose and deal with. Things like SIGSEGV or SIGILL are a nightmare for our 
support staff. At a minimum we'd need a graceful failure with an easy-to-understand error message
 when AVX is not present rather than a crash, before we could make this the default in GATK.

ldgauthier commented on Aug 30
Aside from the users with old hardware, very few of the GCS zones guarantee processors that 
support AVX, which would lead to sporadic failures except in central-1f, for example.

@lucidtronix
Copy link
Contributor

@EdwardDixon Thanks for this! I think the AVX check in CNNScoreVariants is good. As @droazen points out, we still want the split environments, though now with the check in place I think we can make intel optimized the default. Other thoughts @droazen, @cmnbroad ?

@droazen
Copy link
Contributor

droazen commented Oct 9, 2018

@lucidtronix Agreed -- provided that there's a fallback conda environment that users without AVX can use, and provided that the tool produces an easy-to-understand error message when AVX is not present, pointing the user to that fallback environment.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to retain the two alternative conda environments for the reasons mentioned in the last PR. It would be great to make Intel-TF the default, but to do so we need to find a way to handle the non-AVX case gracefully. See my inline comments. Also, the Travis build is still failing like the last PR (exceeding maximum log-length output).

IntelGKLUtils utils = new IntelGKLUtils();
if (utils.isAvxSupported() == false)
{
return new String[]{CNNScoreVariants.AVXREQUIRED_ERROR};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the conversation last time, we can't just completely fail to run when AVX isn't present; we need to allow the user to run using the alternative conda environment. Given that, if we were to make Intel-TF the default, what does the failure mode look like when AVX isn't present ? IIRC, it was a pretty hard failure ? If there were a deterministic exception raised in Python it might be possible detect that and handle it gracefully, and change this message to tell the user what to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the answer is for the conda environments to set an extra environment variable that would allow GATK to detect which conda environment it's in. Then you could have a check in CNNScoreVariants that aborts the tool only if AVX is not present AND you're running in the Intel conda environment, and point the user to the non-Intel conda environment in the error message.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 9, 2018

BTW, if any can provide the output from an interactive python session that tries (and fails) to use Intel-TF on non-AVX hardware it might help us figure out how to detect that case.

@EdwardDixon
Copy link
Contributor Author

EdwardDixon commented Oct 11, 2018 via email

@lucidtronix
Copy link
Contributor

@EdwardDixon I did not know that! In that case master does already require AVX. If it only impacts this tool and we provide sufficient warning and instructions, I think the single intel-optimized conda environment will be so much easier to test and maintain. Users who don't have AVX can simply install an older tensorflow in their environment, but GATK doesn't need to worry about it.

@cmnbroad
Copy link
Collaborator

Hm. Just yesterday we updated from TF 1.4 to 1.9. Although this makes it more compelling to switch the default to Intel-optimized, we may still have an issue for the reasons outlined in the previous PR (academic users, not all GCS zones guaranty AVX hardware, and its still unclear to me if Travis, which uses both GCS and EC2, makes such a guaranty). It sounds like the failure mode is to crash. For running inference at least (training may be a different story), we may need something better. Another option is that it sounds like its possible to build our own distribution without AVX dependencies to use as a fallback.

@EdwardDixon
Copy link
Contributor Author

Hi @cmnbroad, we can do better than crash - you'll notice that in my pull request, I've added a check to CNNScoreVariants.java to test whether AVX is present (the method was checking if command-line arguments are valid, so I stretched a point and used it to test the environment).

~

You could definitely offer your own build of TensorFlow, but do you want to? Apart from anything else, it'll be 10X slower than the default. My understanding is that inference using CNNScoreVariants was taking ~ 50 hours on a typical real-world file before optimization. Wouldn't it be more cost-effective just to use instances that at least have AVX?

@rpomaris
Copy link

Marissa Powers here -- I'm an Intel engineer on the same team as Ed. It sounds like we all agree on having Intel-optimized TF as the default and figuring out the best intervention for older machines from there. We can add the AVX flag within CNNScoreVariant (and any other AI tool). From there, we can (1) provide a detailed error output describing the issue, (2) provide a non-AVX TF build, and (3) automatically roll back TF to the provided version. @EdwardDixon, it sounds like @cmnbroad is suggesting is options (1), (2), and (3), while you're suggesting just (1).

Sound right?

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 12, 2018

Hi Marissa - I think we're all in agreement that we'd like to find a way to make Intel-TF the default, but whether or not we can have CNNScoreVariants require AVX to run is less clear. Naturally, we'd prefer to not have to provide a custom TF distribution for a fallback, but there are 3 cases where we may not have a choice: user with old hardware, Travis/CI testing, and GCE. We may need to provide a fallback environment for those (I'll try to get resolution on that). If it turns out we do, I'm actually not suggesting the fallback be automatic (3 in your list), just that we have a graceful failure mode and an instructive error message.

In the meantime, there is still the issue that this PR fails to even build on Travis. It looks like it produces so much output building the Docker image that it exceeds the allowable Travis build log size. That will need to be resolved, and we'll also need to understand the impact of this change on the size of our Docker image, which is already large, and continues to be a challenge for us.

@droazen
Copy link
Contributor

droazen commented Oct 15, 2018

If AVX is now required in the mainline tensorflow release as well, then I agree with @lucidtronix that we don't need to maintain a fallback environment -- provided that we're talking about AVX1 here, and not AVX2/AVX512. We don't currently have the ability to request nodes on travis or Google cloud with AVX2/AVX512.

I'd like to see this PR modified, however, to throw a UserException from the tool rather than in customCommandLineValidation(). The UserException should include a message with a pointer to versions of Tensorflow that work on non-AVX hardware.

@@ -202,6 +204,13 @@
return new String[]{"No default architecture for tensor type:" + tensorType.name()};
}
}

IntelGKLUtils utils = new IntelGKLUtils();
if (utils.isAvxSupported() == false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you throw a UserException from the tool's onStartup() method instead of doing the check here in customCommandLineValidation(), or is there a reason why you're doing the check here?

The error message should also be more helpful (eg., it should contain a pointer to the latest version of Tensorflow that runs on non-AVX hardware).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, AVX, not AVX2 is the minimum hardware required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen @lucidtronix Can't over-ride onStartup() because the base class VariantWalker made the method final - the advice in the comment for that method suggests placing "startup" code in onTraversalStart. Is customCommandLineValidation perhaps the "least bad" place to put the check?

Another problem: adding the advice to users on how to access an older/compatible TensorFlow build looks like it could be problematic too: how is our AVX check to know whether the installed build of TensorFlow requires AVX? We could check the version - but a user might try a non-AVX-requiring user-created build (@mepowers linked to a v1.8 TensorFlow that doesn't require it, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(so the user installs a compatible TF, but the 'AVX required' error would not go away).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardDixon Sorry, I meant onTraversalStart(), not onStartup(). Precondition checks like this belong in onTraversalStart() -- customCommandLineValidation() is for injecting hooks into the arg parser.

As for the second issue, I'd suggest that we add a --disable-avx-check argument to the tool. The UserException message can then say something like this: "This tool requires AVX instruction set support by default due to its dependency on recent versions of the TensorFlow library. If you have an older (pre-1.6) version of TensorFlow installed that does not require AVX, you may attempt to re-run the tool with the --disable-avx-check argument to bypass this check. Note that such configurations are not officially supported."

@droazen droazen self-assigned this Oct 15, 2018
@EdwardDixon
Copy link
Contributor Author

@mepowers Happy with this?

@cmnbroad
Copy link
Collaborator

Sounds like a good step forward to me. Before we can merge this though, we still need to get it to build and pass tests on Travis, and see what the impact is on the size of the docker image, in case we have an issue with that.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 15, 2018

@EdwardDixon You might try adding "-q" to the conda command in the Dockerfile to see if that allows the build to proceed. If that works, then we can see what the test results look like, and the resulting docker image size.

@rpomaris
Copy link

@EdwardDixon - This sounds good to me. Do we think throwing a throw a UserException from onStartup() rather than in customCommandLineValidation() will solve both the Travis issue and the size of the Docker image?

Re: Pointing to non-AVX Tensorflow, here's a wheel built for TF v1.8 on a SandyBridge machine, pulled from the thread @cmnbroad referenced above. Open to other suggestions.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 15, 2018

Changing the exception throwing code won't help with either of those - that code would only execute when the tool is actually run. Currently, the PR is failing to even build on Travis because Travis is killing it during the part of the build where it creates the Docker image on which the tests will run. It appears to be producing so much progress output during the conda environment creation - right when its resolving tensorflow packages - that Travis terminates the whole build.

My suggestion above was to see if we can (at least temporarily) get past that so we can see get the build to succeed; see how big the resulting Docker is; and whether the CNNScoreVariants tests pass with the new environment. Then we can figure out if we have any additional issues to resolve.

@EdwardDixon
Copy link
Contributor Author

EdwardDixon commented Oct 15, 2018 via email

@droazen droazen mentioned this pull request Oct 19, 2018
@lucidtronix
Copy link
Contributor

So most of the jobs are still failing with: The job exceeded the maximum log length, and has been terminated. @cmnbroad @droazen What log is travis referring to? In the Travis Job log tab I only see < 3000 lines and other tests routinely output way more than that and pass.

@lbergelson
Copy link
Member

@droazen I'm not sure customCommandLineValidation works as expected. I've wanted for a long time to support something like it for argument collections though... Maybe we should consider beefing it up in barclay.

@rpomaris
Copy link

rpomaris commented Nov 8, 2018

@cmnbroad @lucidtronix -- @EdwardDixon and I made a revision to condaenv.yml.template and re-committed. Now, it looks like the Travis build is still failing for a couple jobs, but the rest are successful, which seems like a solid improvement. Could you please check the fails and see what you think?

@EdwardDixon
Copy link
Contributor Author

Much appreciated! I had pulled from the broad master before your message came in & done as you suggest with our changes, which are now confined to CNNScoreVariants. I think I may have managed it... if not I'll go back to step 1 and give it one more try.

@EdwardDixon
Copy link
Contributor Author

@droazen Applied your steps, I hope correctly - the pull request looks clean now. Your steps were a huge help.

The travis build looks like failing now, for reasons not obviously connected with our commit:

Error: (converted from warning) unable to access index for repository http://cran.mtu.edu/src/contrib Execution halted The command "if [[ $TEST_DOCKER != true ]]; then sudo mkdir -p /usr/local/lib/R/; sudo mkdir -p site-library; sudo ln -sFv ~/site-library /usr/local/lib/R/site-library; sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9; sudo add-apt-repository "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"; sudo apt-get update; sudo apt-get install -y --force-yes r-base-dev=3.1.3-1trusty; sudo apt-get install -y --force-yes r-base-core=3.1.3-1trusty; sudo Rscript scripts/docker/gatkbase/install_R_packages.R; fi;" failed and exited with 1 during .

@EdwardDixon
Copy link
Contributor Author

Looks like CI is erroring out for other unrelated pull requests also? https://travis-ci.org/broadinstitute/gatk/builds/453937675

@EdwardDixon
Copy link
Contributor Author

@mepowers any ideas? Code looks good, variant score tests are passing but other ci steps erroring out.

@EdwardDixon
Copy link
Contributor Author

@droazen @cmnbroad @lucidtronix @mepowers All tests passing! Hope we are now good to merge?

Copy link
Contributor

@lucidtronix lucidtronix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here all look fine to me, but as it stands this PR just adds the check it doesn't actually make the upgrade. Are we going to do a separate PR collapsing the conda envrionments into one and updgrading the yaml to point at intel's tensorflow?

@EdwardDixon
Copy link
Contributor Author

@lucidtronix You are actually using the Intel build already - ever since you switched to TF 1.9 from Conda, so you kind of need this check now!

@rpomaris
Copy link

Good catch @EdwardDixon.

From @cmnbroad 's Oct 12 post:

We may need to provide a fallback environment for those (I'll try to get resolution on that). If it turns out we do, I'm actually not suggesting the fallback be automatic (3 in your list), just that we have a graceful failure mode and an instructive error message.

We have this now, right? Just want to be sure. If there's more work to be done, let's hash it out on another PR.

Copy link
Contributor Author

@EdwardDixon EdwardDixon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @EdwardDixon.

From @cmnbroad 's Oct 12 post:

We may need to provide a fallback environment for those (I'll try to get resolution on that). If it turns out we do, I'm actually not suggesting the fallback be automatic (3 in your list), just that we have a graceful failure mode and an instructive error message.

We have this now, right? Just want to be sure. If there's more work to be done, let's hash it out on another PR.

We have this now! I had actually just committed a readme change & a change to the build script & template to remove the now-outdated reference to a separate environment for Intel changes.

@@ -202,6 +204,13 @@
return new String[]{"No default architecture for tensor type:" + tensorType.name()};
}
}

IntelGKLUtils utils = new IntelGKLUtils();
if (utils.isAvxSupported() == false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(so the user installs a compatible TF, but the 'AVX required' error would not go away).

@EdwardDixon
Copy link
Contributor Author

Thanks @lucidtronix ! @cmnbroad , @droazen , are we good to merge?

@droazen droazen removed their assignment Nov 16, 2018
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is close now. Minor cleanup changes requested and this should be good to go. Thanks for seeing it through @EdwardDixon!

README.md Outdated
./gatk PrintReadsSpark \
-I gs://my-gcs-bucket/path/to/input.bam \
-O gs://my-gcs-bucket/path/to/output.bam \
```n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some text was inadvertently deleted here, and should be restored.

@@ -43,8 +43,8 @@ dependencies:
- scikit-learn==0.19.1
- scipy==1.0.0
- six==1.11.0
- $tensorFlowDependency
- tensorflow==1.9.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tensorFlowDependency variable still exists in build.gradle, but due to this change its no longer honored. I think if you just completely revert all the changes in this file everything will be right again.

@@ -160,6 +166,11 @@
@Argument(fullName = "output-tensor-dir", shortName = "output-tensor-dir", doc = "Optional directory where tensors can be saved for debugging or visualization.", optional = true)
private String outputTensorsDir = "";

@Advanced
@Argument(fullName = "disable-avx-check", shortName = "disable-avx-check", doc = "If set, no check will be made for AVX support. " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fullName and shortName should both reference the DISABLE_AVX_CHECK_NAME constant mentioned above.

@@ -108,6 +111,9 @@
"1D models will look at the reference sequence and variant annotations." +
"2D models look at aligned reads, reference sequence, and variant annotations." +
"2D models require a BAM file as input as well as the tensor-type argument to be set.";
static final String AVXREQUIRED_ERROR = "This tool requires AVX instruction set support by default due to its dependency on recent versions of the TensorFlow library.\n" +
" If you have an older (pre-1.6) version of TensorFlow installed that does not require AVX you may attempt to re-run the tool with the --disable-avx-check argument to bypass this check.\n" +
" Note that such configurations are not officially supported.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a string constant DISABLE_AVX_CHECK_NAME for the value disable-avx-check, and replace the argument name thats embedded in this String with a %s format specifier. When the exception is thrown below, call call String.format passing it this string and the DISABLE_AVX_CHECK_NAME string.

if (utils.isAvxSupported() == false) {
// Give user the bad news, suggest remedies.
throw new UserException.HardwareFeatureException(CNNScoreVariants.AVXREQUIRED_ERROR);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw new UserException.HardwareFeatureException(String.format(CNNScoreVariants.AVXREQUIRED_ERROR, DISABLE_AVX_CHECK_NAME);

@EdwardDixon
Copy link
Contributor Author

@droazen @lucidtronix @cmnbroad Made some changes... I think our command is a little bit tidier than its neighbors. Conda env template now back the way you like it. Hopefully merge-worthy?

@cmnbroad
Copy link
Collaborator

@EdwardDixon Everything else looks good now, but there are still 3 lines deleted from README.md that I assume are from a bad merge or bad conflict resolution or something. Can you restore those 3 lines ? thx.

@EdwardDixon
Copy link
Contributor Author

EdwardDixon commented Nov 16, 2018 via email

@cmnbroad
Copy link
Collaborator

cmnbroad commented Nov 16, 2018

@EdwardDixon Agreed. But I'm talking about lines 276-279, where there is what appears to be an unrelated/accidental deletion.

@cmnbroad cmnbroad dismissed droazen’s stale review November 19, 2018 16:38

Issues all have been addressed.

@cmnbroad cmnbroad merged commit b7c0560 into broadinstitute:master Nov 19, 2018
@cmnbroad
Copy link
Collaborator

Thanks for fixing that @EdwardDixon.

@EdwardDixon
Copy link
Contributor Author

@cmnbroad You just made my Thanksgiving!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants