-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX present? #5291
AVX present? #5291
Conversation
@lucidtronix Hi Sam, wonder if you could take a look at this? Adding the "AVX required" message. |
@EdwardDixon Isn't this a dupe of #5142? Have you addressed our original concerns from that PR's discussion thread, some of which I've reproduced below?
|
@EdwardDixon Thanks for this! I think the AVX check in CNNScoreVariants is good. As @droazen points out, we still want the split environments, though now with the check in place I think we can make intel optimized the default. Other thoughts @droazen, @cmnbroad ? |
@lucidtronix Agreed -- provided that there's a fallback conda environment that users without AVX can use, and provided that the tool produces an easy-to-understand error message when AVX is not present, pointing the user to that fallback environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to retain the two alternative conda environments for the reasons mentioned in the last PR. It would be great to make Intel-TF the default, but to do so we need to find a way to handle the non-AVX case gracefully. See my inline comments. Also, the Travis build is still failing like the last PR (exceeding maximum log-length output).
IntelGKLUtils utils = new IntelGKLUtils(); | ||
if (utils.isAvxSupported() == false) | ||
{ | ||
return new String[]{CNNScoreVariants.AVXREQUIRED_ERROR}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the conversation last time, we can't just completely fail to run when AVX isn't present; we need to allow the user to run using the alternative conda environment. Given that, if we were to make Intel-TF the default, what does the failure mode look like when AVX isn't present ? IIRC, it was a pretty hard failure ? If there were a deterministic exception raised in Python it might be possible detect that and handle it gracefully, and change this message to tell the user what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the answer is for the conda environments to set an extra environment variable that would allow GATK to detect which conda environment it's in. Then you could have a check in CNNScoreVariants
that aborts the tool only if AVX is not present AND you're running in the Intel conda environment, and point the user to the non-Intel conda environment in the error message.
BTW, if any can provide the output from an interactive python session that tries (and fails) to use Intel-TF on non-AVX hardware it might help us figure out how to detect that case. |
Not a bad idea, will look into that tomorrow.
Note that you are using Tensorflow 1.4 or 1.5 and that from v1.6 even the
non-Intel optimized build supports only AVX capable machines.
…On Thu 11 Oct 2018, 21:07 droazen, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java
<#5291 (comment)>:
> @@ -198,6 +200,13 @@
return new String[]{"No default architecture for tensor type:" + tensorType.name()};
}
}
+
+ IntelGKLUtils utils = new IntelGKLUtils();
+ if (utils.isAvxSupported() == false)
+ {
+ return new String[]{CNNScoreVariants.AVXREQUIRED_ERROR};
Maybe the answer is for the conda environments to set an extra environment
variable that would allow GATK to detect which conda environment it's in.
Then you could have a check in CNNScoreVariants that aborts the tool only
if AVX is not present AND you're running in the Intel conda environment,
and point the user to the non-Intel conda environment in the error message.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5291 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG6lr8HM6ItLWqfSaTKeVY4yCp07il29ks5uj6TugaJpZM4XNHdi>
.
|
@EdwardDixon I did not know that! In that case master does already require AVX. If it only impacts this tool and we provide sufficient warning and instructions, I think the single intel-optimized conda environment will be so much easier to test and maintain. Users who don't have AVX can simply install an older tensorflow in their environment, but GATK doesn't need to worry about it. |
Hm. Just yesterday we updated from TF 1.4 to 1.9. Although this makes it more compelling to switch the default to Intel-optimized, we may still have an issue for the reasons outlined in the previous PR (academic users, not all GCS zones guaranty AVX hardware, and its still unclear to me if Travis, which uses both GCS and EC2, makes such a guaranty). It sounds like the failure mode is to crash. For running inference at least (training may be a different story), we may need something better. Another option is that it sounds like its possible to build our own distribution without AVX dependencies to use as a fallback. |
Hi @cmnbroad, we can do better than crash - you'll notice that in my pull request, I've added a check to ~ You could definitely offer your own build of TensorFlow, but do you want to? Apart from anything else, it'll be 10X slower than the default. My understanding is that inference using CNNScoreVariants was taking ~ 50 hours on a typical real-world file before optimization. Wouldn't it be more cost-effective just to use instances that at least have AVX? |
Marissa Powers here -- I'm an Intel engineer on the same team as Ed. It sounds like we all agree on having Intel-optimized TF as the default and figuring out the best intervention for older machines from there. We can add the AVX flag within CNNScoreVariant (and any other AI tool). From there, we can (1) provide a detailed error output describing the issue, (2) provide a non-AVX TF build, and (3) automatically roll back TF to the provided version. @EdwardDixon, it sounds like @cmnbroad is suggesting is options (1), (2), and (3), while you're suggesting just (1). Sound right? |
Hi Marissa - I think we're all in agreement that we'd like to find a way to make Intel-TF the default, but whether or not we can have CNNScoreVariants require AVX to run is less clear. Naturally, we'd prefer to not have to provide a custom TF distribution for a fallback, but there are 3 cases where we may not have a choice: user with old hardware, Travis/CI testing, and GCE. We may need to provide a fallback environment for those (I'll try to get resolution on that). If it turns out we do, I'm actually not suggesting the fallback be automatic (3 in your list), just that we have a graceful failure mode and an instructive error message. In the meantime, there is still the issue that this PR fails to even build on Travis. It looks like it produces so much output building the Docker image that it exceeds the allowable Travis build log size. That will need to be resolved, and we'll also need to understand the impact of this change on the size of our Docker image, which is already large, and continues to be a challenge for us. |
If AVX is now required in the mainline tensorflow release as well, then I agree with @lucidtronix that we don't need to maintain a fallback environment -- provided that we're talking about AVX1 here, and not AVX2/AVX512. We don't currently have the ability to request nodes on travis or Google cloud with AVX2/AVX512. I'd like to see this PR modified, however, to throw a |
@@ -202,6 +204,13 @@ | |||
return new String[]{"No default architecture for tensor type:" + tensorType.name()}; | |||
} | |||
} | |||
|
|||
IntelGKLUtils utils = new IntelGKLUtils(); | |||
if (utils.isAvxSupported() == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you throw a UserException
from the tool's onStartup()
method instead of doing the check here in customCommandLineValidation()
, or is there a reason why you're doing the check here?
The error message should also be more helpful (eg., it should contain a pointer to the latest version of Tensorflow that runs on non-AVX hardware).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, AVX, not AVX2 is the minimum hardware required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen @lucidtronix Can't over-ride onStartup() because the base class VariantWalker made the method final
- the advice in the comment for that method suggests placing "startup" code in onTraversalStart
. Is customCommandLineValidation
perhaps the "least bad" place to put the check?
Another problem: adding the advice to users on how to access an older/compatible TensorFlow build looks like it could be problematic too: how is our AVX check to know whether the installed build of TensorFlow requires AVX? We could check the version - but a user might try a non-AVX-requiring user-created build (@mepowers linked to a v1.8 TensorFlow that doesn't require it, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(so the user installs a compatible TF, but the 'AVX required' error would not go away).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdwardDixon Sorry, I meant onTraversalStart()
, not onStartup()
. Precondition checks like this belong in onTraversalStart()
-- customCommandLineValidation()
is for injecting hooks into the arg parser.
As for the second issue, I'd suggest that we add a --disable-avx-check
argument to the tool. The UserException
message can then say something like this: "This tool requires AVX instruction set support by default due to its dependency on recent versions of the TensorFlow library. If you have an older (pre-1.6) version of TensorFlow installed that does not require AVX, you may attempt to re-run the tool with the --disable-avx-check argument to bypass this check. Note that such configurations are not officially supported."
@mepowers Happy with this? |
Sounds like a good step forward to me. Before we can merge this though, we still need to get it to build and pass tests on Travis, and see what the impact is on the size of the docker image, in case we have an issue with that. |
@EdwardDixon You might try adding "-q" to the |
@EdwardDixon - This sounds good to me. Do we think throwing a throw a Re: Pointing to non-AVX Tensorflow, here's a wheel built for TF v1.8 on a SandyBridge machine, pulled from the thread @cmnbroad referenced above. Open to other suggestions. |
Changing the exception throwing code won't help with either of those - that code would only execute when the tool is actually run. Currently, the PR is failing to even build on Travis because Travis is killing it during the part of the build where it creates the Docker image on which the tests will run. It appears to be producing so much progress output during the conda environment creation - right when its resolving tensorflow packages - that Travis terminates the whole build. My suggestion above was to see if we can (at least temporarily) get past that so we can see get the build to succeed; see how big the resulting Docker is; and whether the CNNScoreVariants tests pass with the new environment. Then we can figure out if we have any additional issues to resolve. |
Anecdotally, some people have seen logging issues around the Conda
progress bars filling logs. I know Conda had a change request to add a
silent or quiet mode. Suggestions on the back of a postcard...
…On Mon 15 Oct 2018, 21:27 Chris Norman, ***@***.***> wrote:
Changing the exception throwing code won't help with either of those -
that code would only execute when the tool is actually run. Currently, the
PR is failing to even build on Travis during the part of the build where it
creates the Docker image on which the tests will run. Travis is killing it
because its producing so much progress output during the conda environment
creation - right when its resolving tensorflow packages.
My suggestion above was to see if we can (at least temporarily) get past
that so we can see how big the Docker is, and whether the CNNScoreVariants
tests pass with the new environment. Then we can figure out if we have any
additional issues to resolve.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5291 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG6lrwV2gaTYs6ur_AXznl6iV0AMxQ8Cks5ulO_DgaJpZM4XNHdi>
.
|
@droazen I'm not sure |
@cmnbroad @lucidtronix -- @EdwardDixon and I made a revision to condaenv.yml.template and re-committed. Now, it looks like the Travis build is still failing for a couple jobs, but the rest are successful, which seems like a solid improvement. Could you please check the fails and see what you think? |
Much appreciated! I had pulled from the broad master before your message came in & done as you suggest with our changes, which are now confined to CNNScoreVariants. I think I may have managed it... if not I'll go back to step 1 and give it one more try. |
2ec3f4e
to
2af3c00
Compare
@droazen Applied your steps, I hope correctly - the pull request looks clean now. Your steps were a huge help. The travis build looks like failing now, for reasons not obviously connected with our commit:
|
Looks like CI is erroring out for other unrelated pull requests also? https://travis-ci.org/broadinstitute/gatk/builds/453937675 |
@mepowers any ideas? Code looks good, variant score tests are passing but other ci steps erroring out. |
@droazen @cmnbroad @lucidtronix @mepowers All tests passing! Hope we are now good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here all look fine to me, but as it stands this PR just adds the check it doesn't actually make the upgrade. Are we going to do a separate PR collapsing the conda envrionments into one and updgrading the yaml to point at intel's tensorflow?
@lucidtronix You are actually using the Intel build already - ever since you switched to TF 1.9 from Conda, so you kind of need this check now! |
Good catch @EdwardDixon. From @cmnbroad 's Oct 12 post:
We have this now, right? Just want to be sure. If there's more work to be done, let's hash it out on another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @EdwardDixon.
From @cmnbroad 's Oct 12 post:
We may need to provide a fallback environment for those (I'll try to get resolution on that). If it turns out we do, I'm actually not suggesting the fallback be automatic (3 in your list), just that we have a graceful failure mode and an instructive error message.
We have this now, right? Just want to be sure. If there's more work to be done, let's hash it out on another PR.
We have this now! I had actually just committed a readme change & a change to the build script & template to remove the now-outdated reference to a separate environment for Intel changes.
@@ -202,6 +204,13 @@ | |||
return new String[]{"No default architecture for tensor type:" + tensorType.name()}; | |||
} | |||
} | |||
|
|||
IntelGKLUtils utils = new IntelGKLUtils(); | |||
if (utils.isAvxSupported() == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(so the user installs a compatible TF, but the 'AVX required' error would not go away).
Thanks @lucidtronix ! @cmnbroad , @droazen , are we good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is close now. Minor cleanup changes requested and this should be good to go. Thanks for seeing it through @EdwardDixon!
README.md
Outdated
./gatk PrintReadsSpark \ | ||
-I gs://my-gcs-bucket/path/to/input.bam \ | ||
-O gs://my-gcs-bucket/path/to/output.bam \ | ||
```n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some text was inadvertently deleted here, and should be restored.
scripts/gatkcondaenv.yml.template
Outdated
@@ -43,8 +43,8 @@ dependencies: | |||
- scikit-learn==0.19.1 | |||
- scipy==1.0.0 | |||
- six==1.11.0 | |||
- $tensorFlowDependency | |||
- tensorflow==1.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tensorFlowDependency
variable still exists in build.gradle
, but due to this change its no longer honored. I think if you just completely revert all the changes in this file everything will be right again.
@@ -160,6 +166,11 @@ | |||
@Argument(fullName = "output-tensor-dir", shortName = "output-tensor-dir", doc = "Optional directory where tensors can be saved for debugging or visualization.", optional = true) | |||
private String outputTensorsDir = ""; | |||
|
|||
@Advanced | |||
@Argument(fullName = "disable-avx-check", shortName = "disable-avx-check", doc = "If set, no check will be made for AVX support. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fullName
and shortName
should both reference the DISABLE_AVX_CHECK_NAME constant mentioned above.
@@ -108,6 +111,9 @@ | |||
"1D models will look at the reference sequence and variant annotations." + | |||
"2D models look at aligned reads, reference sequence, and variant annotations." + | |||
"2D models require a BAM file as input as well as the tensor-type argument to be set."; | |||
static final String AVXREQUIRED_ERROR = "This tool requires AVX instruction set support by default due to its dependency on recent versions of the TensorFlow library.\n" + | |||
" If you have an older (pre-1.6) version of TensorFlow installed that does not require AVX you may attempt to re-run the tool with the --disable-avx-check argument to bypass this check.\n" + | |||
" Note that such configurations are not officially supported."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a string constant DISABLE_AVX_CHECK_NAME
for the value disable-avx-check
, and replace the argument name thats embedded in this String with a %s
format specifier. When the exception is thrown below, call call String.format passing it this string and the DISABLE_AVX_CHECK_NAME
string.
if (utils.isAvxSupported() == false) { | ||
// Give user the bad news, suggest remedies. | ||
throw new UserException.HardwareFeatureException(CNNScoreVariants.AVXREQUIRED_ERROR); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw new UserException.HardwareFeatureException(String.format(CNNScoreVariants.AVXREQUIRED_ERROR, DISABLE_AVX_CHECK_NAME);
@droazen @lucidtronix @cmnbroad Made some changes... I think our command is a little bit tidier than its neighbors. Conda env template now back the way you like it. Hopefully merge-worthy? |
@EdwardDixon Everything else looks good now, but there are still 3 lines deleted from README.md that I assume are from a bad merge or bad conflict resolution or something. Can you restore those 3 lines ? thx. |
We thought the readme should reflect that there is now just 1 Conda
environment, not 2.
…On Fri 16 Nov 2018, 19:34 Chris Norman ***@***.*** wrote:
@EdwardDixon <https://github.com/EdwardDixon> Everything else looks good
now, but there are still 3 lines deleted from README.md that I assume are
from a bad merge or bad conflict resolution or something. Can you restore
those 3 lines ? thx.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5291 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AG6lr629Lh_D-xi_GUvwRs6DNo3-IHdWks5uvxNQgaJpZM4XNHdi>
.
|
@EdwardDixon Agreed. But I'm talking about lines 276-279, where there is what appears to be an unrelated/accidental deletion. |
Thanks for fixing that @EdwardDixon. |
@cmnbroad You just made my Thanksgiving! |
CNNScoreVariant relies on a computationally demanding operation - a deep neural network. Using an Intel-optimized version of TensorFlow gives a 10X improvement in performance (e.g. 50 hours to 5 hours for a typical input). However, these improvements mean that we now have a minimum hardware requirement - the availability of AVX.