-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094
Comments
Is this also related with the emails that we are receiving from Jenkins (e.g., https://gatk-jenkins.broadinstitute.org/job/GATK-BQSR/1359/)? |
@magicDGS I don't think so. Those tests have been failing forever. I think someone finally fixed the bug that was causing the email notifications to not happen. They're useless and annoying though, so I disabled that test... We haven't fixed it in years, so we're unlikely to do so now... |
Ok, thanks for the feedback @lbergelson! |
I've also seen UnknownHostException: metadata, which seems like it's probably related. My favorite part about that exception is
|
See #4888, which is an older report for the same issue. As mentioned there, I think we should patch our fork of |
This brings in some additional retries for UnknownHostException and 502 errors, and moves us from a fork in my personal github repository to the fork in https://github.com/broadinstitute/google-cloud-java Resolves #4888 Resolves #5094
This brings in some additional retries for UnknownHostException and 502 errors, and moves us from a fork in my personal github repository to the fork in https://github.com/broadinstitute/google-cloud-java Resolves #4888 Resolves #5094
…nownHostException We've frequently encountered both of these errors transiently in the wild (see broadinstitute/gatk#4888 and broadinstitute/gatk#5094 I also increased the maximum depth when inspecting nested exceptions looking for reopenable errors from 10 to 20, as we've seen chains of exceptions that come very close to the current limit of 10.
Reopening -- @ldgauthier reports that this error still occurs even after the patch in #5099. With that patch, we are now retrying on
|
@jean-philippe-martin Can you comment on this error with your thoughts? Despite now doing a channel reopen on |
More info from @ldgauthier:
So it seems like the error is nondeterministic, but can't be recovered from within the same VM instance / process. |
@ldgauthier 's error sounds like what I saw before when trying to run the joint genotyping pipeline. When I spoke about it with @ruchim she said that based on some experiments she did and conversations with the production team she thought it was a symptom VM's running under PAPI; Slack excerpt:
So we decided that there's an effective limit of about 17 hours for VMs managed by PAPI at which point either the network configuration or metadata process server on the VMs changes, causing these failures. I worked around the issue by increasing my scatter interval count such that no tasks took longer than the ~17 hours that seems to be the critical point. |
@ruchim Agreed -- feel free to grab some free time on our calendars. |
This was resolved by a patch to PAPIv1 just released by Google (see https://partnerissuetracker.corp.google.com/issues/112704449). Closing! |
We're seeing a high rate of failures running BQSR pipelines in production with
ExecutionException
s.The root cause seems to be an
UnknownHostException
thrown in the google storage api client.The text was updated successfully, but these errors were encountered: