GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094

lbergelson · 2018-08-08T15:08:48Z

We're seeing a high rate of failures running BQSR pipelines in production with ExecutionExceptions.

The root cause seems to be an UnknownHostException thrown in the google storage api client.

java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: www.googleapis.com
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.read(SeekableByteChannelPrefetcher.java:309)
	at htsjdk.samtools.seekablestream.SeekablePathStream.read(SeekablePathStream.java:86)
	at htsjdk.samtools.util.BlockCompressedInputStream.readBytes(BlockCompressedInputStream.java:567)
	at htsjdk.samtools.util.BlockCompressedInputStream.readBytes(BlockCompressedInputStream.java:556)
	at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:525)
	at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
	at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
	at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
	at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404)
	at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
	at htsjdk.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:490)
	at htsjdk.samtools.util.BinaryCodec.readInt(BinaryCodec.java:501)
	at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:198)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:829)
	at htsjdk.samtools.BAMFileReader$BAMFileIndexIterator.getNextRecord(BAMFileReader.java:981)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:803)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
	at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
	at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.advance(BAMFileReader.java:1034)
	at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.next(BAMFileReader.java:1024)
	at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.next(BAMFileReader.java:988)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
	at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextRecord(SamReaderQueryingIterator.java:114)
	at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.next(SamReaderQueryingIterator.java:151)
	at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.next(SamReaderQueryingIterator.java:29)
	at org.broadinstitute.hellbender.utils.iterators.SAMRecordToReadIterator.next(SAMRecordToReadIterator.java:29)
	at org.broadinstitute.hellbender.utils.iterators.SAMRecordToReadIterator.next(SAMRecordToReadIterator.java:15)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
	at org.broadinstitute.hellbender.Main.main(Main.java:233)
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: www.googleapis.com
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher$WorkUnit.getBuf(SeekableByteChannelPrefetcher.java:136)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.fetch(SeekableByteChannelPrefetcher.java:255)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.read(SeekableByteChannelPrefetcher.java:300)
	... 45 more
Caused by: com.google.cloud.storage.StorageException: www.googleapis.com
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:526)
	at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:127)
	at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:124)
	at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:94)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
	at com.google.cloud.storage.BlobReadChannel.read(BlobReadChannel.java:124)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.read(CloudStorageReadChannel.java:114)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher$WorkUnit.call(SeekableByteChannelPrefetcher.java:131)
	at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher$WorkUnit.call(SeekableByteChannelPrefetcher.java:104)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: www.googleapis.com
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
	at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
	at shaded.cloud_nio.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6133)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:505)

The text was updated successfully, but these errors were encountered:

magicDGS · 2018-08-08T18:02:37Z

Is this also related with the emails that we are receiving from Jenkins (e.g., https://gatk-jenkins.broadinstitute.org/job/GATK-BQSR/1359/)?

lbergelson · 2018-08-08T18:10:24Z

@magicDGS I don't think so. Those tests have been failing forever. I think someone finally fixed the bug that was causing the email notifications to not happen. They're useless and annoying though, so I disabled that test... We haven't fixed it in years, so we're unlikely to do so now...

magicDGS · 2018-08-08T18:11:33Z

Ok, thanks for the feedback @lbergelson!

ldgauthier · 2018-08-08T18:20:14Z

I've also seen UnknownHostException: metadata, which seems like it's probably related. My favorite part about that exception is This is likely because code is not running on Google Compute Engine.

java.util.concurrent.CompletionException: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$601(GenomicsDBImport.java:605)
	at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReadersInParallel(GenomicsDBImport.java:600)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.createSampleToReaderMap(GenomicsDBImport.java:491)
	at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:602)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
	... 3 more
Caused by: java.util.concurrent.ExecutionException: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$601(GenomicsDBImport.java:602)
	... 8 more
Caused by: org.broadinstitute.hellbender.exceptions.UserException: Failed to create reader from gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:640)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$600(GenomicsDBImport.java:593)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine., for input source: gs://fc-c64a0fcd-fb30-4a7e-bdc6-3c09a9286941/f6aeb0ce-044a-4b36-a5f0-d3fda62252a5/ReblockGVCF/66c24439-cfeb-4b85-bc02-aefe693177b6/call-GenotypeGVCF/NWD197223.vcf.gz
	at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:97)
	at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:82)
	at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:109)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:638)
	... 5 more
Caused by: com.google.cloud.storage.StorageException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:335)
	at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:191)
	at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:188)
	at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:94)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
	at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:188)
	at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:202)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.fetchSize(CloudStorageReadChannel.java:183)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.<init>(CloudStorageReadChannel.java:78)
	at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.create(CloudStorageReadChannel.java:68)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newReadChannel(CloudStorageFileSystemProvider.java:304)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:265)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.newByteChannel(Files.java:407)
	at htsjdk.samtools.seekablestream.SeekablePathStream.<init>(SeekablePathStream.java:41)
	at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:101)
	at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:94)
	... 8 more
Caused by: java.io.IOException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
	at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:106)
	at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:149)
	at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:135)
	at shaded.cloud_nio.com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
	at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:157)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:333)
	... 24 more
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
	at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
	at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:104)
	... 34 more

droazen · 2018-08-10T16:25:04Z

See #4888, which is an older report for the same issue. As mentioned there, I think we should patch our fork of google-cloud-java to do a channel reopen on UnknownHostException for now as a quick fix. @jean-philippe-martin is eventually going to add an official configuration mechanism for clients of google-cloud-java to customize which errors should trigger a retry/reopen, which should provide a better way to deal with these errors as they crop up without having to modify the NIO library itself.

This brings in some additional retries for UnknownHostException and 502 errors, and moves us from a fork in my personal github repository to the fork in https://github.com/broadinstitute/google-cloud-java Resolves #4888 Resolves #5094

…nownHostException We've frequently encountered both of these errors transiently in the wild (see broadinstitute/gatk#4888 and broadinstitute/gatk#5094 I also increased the maximum depth when inspecting nested exceptions looking for reopenable errors from 10 to 20, as we've seen chains of exceptions that come very close to the current limit of 10.

droazen · 2018-08-14T15:04:43Z

Reopening -- @ldgauthier reports that this error still occurs even after the patch in #5099. With that patch, we are now retrying on UnknownHostException, but the retries are all failing:

[August 14, 2018 7:09:18 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 896.64 minutes.
Runtime.totalMemory()=3966238720
java.util.concurrent.CompletionException: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.broadinstitute.hellbender.exceptions.UserException$CouldNotReadInputFile: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$614(GenomicsDBImport.java:605)
    at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReadersInParallel(GenomicsDBImport.java:600)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.createSampleToReaderMap(GenomicsDBImport.java:491)
    at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:602)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
    ... 3 more
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$614(GenomicsDBImport.java:602)
    ... 8 more
Caused by: com.google.cloud.storage.StorageException: All 20 reopens failed. Waited a total of 1918000 ms between attempts
    at com.google.cloud.storage.contrib.nio.CloudStorageRetryHandler.handleReopenForStorageException(CloudStorageRetryHandler.java:124)
    at com.google.cloud.storage.contrib.nio.CloudStorageRetryHandler.handleStorageException(CloudStorageRetryHandler.java:94)
    at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:621)
    at java.nio.file.Files.exists(Files.java:2385)
    at htsjdk.tribble.util.ParsingUtils.resourceExists(ParsingUtils.java:419)
    at htsjdk.tribble.AbstractFeatureReader.isTabix(AbstractFeatureReader.java:222)
    at htsjdk.tribble.AbstractFeatureReader$ComponentMethods.isTabix(AbstractFeatureReader.java:228)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromPath(GenomicsDBImport.java:638)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.lambda$getFeatureReadersInParallel$613(GenomicsDBImport.java:593)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more
Caused by: com.google.cloud.storage.StorageException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:335)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:191)
    at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:188)
    at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:94)
    at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:54)
    at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:188)
    at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:614)
    ... 11 more
Caused by: java.io.IOException: ComputeEngineCredentials cannot find the metadata server. This is likely because code is not running on Google Compute Engine.
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:106)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:149)
    at shaded.cloud_nio.com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:135)
    at shaded.cloud_nio.com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
    at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:157)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:333)
    ... 17 more
Caused by: java.net.UnknownHostException: metadata
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
    at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:104)
    ... 27 more

droazen · 2018-08-14T15:11:51Z

@jean-philippe-martin Can you comment on this error with your thoughts? Despite now doing a channel reopen on UnknownHostException in our fork of the NIO library, all reopens are failing, which implies that this error can't be recovered from via a simple retry. Could there be something wrong in our authentication setup?

droazen · 2018-08-14T15:40:41Z

More info from @ldgauthier:

I’ve only been trying the same GenomicsDBImport over and over again.  I estimate it to take about
30 hours if it’s ever successful. The exception happens in different batches every time.  Sam F. 
said he saw the exception too but he could eventually resubmit his way through it.  His jobs are
shorter running.

So it seems like the error is nondeterministic, but can't be recovered from within the same VM instance / process.

cwhelan · 2018-08-14T16:46:36Z

@ldgauthier 's error sounds like what I saw before when trying to run the joint genotyping pipeline. When I spoke about it with @ruchim she said that based on some experiments she did and conversations with the production team she thought it was a symptom VM's running under PAPI; Slack excerpt:

rmunshi [9:59 AM]
Last night I reached out to the Pipelines API folks
I ran a script that outputs the external IP address of the VM every 5 mins
and I see that after about ~17 hours, the request fails with

`curl: (6) Could not resolve host: metadata.google.internal` (edited)

So we decided that there's an effective limit of about 17 hours for VMs managed by PAPI at which point either the network configuration or metadata process server on the VMs changes, causing these failures.

I worked around the issue by increasing my scatter interval count such that no tasks took longer than the ~17 hours that seems to be the critical point.

ruchim · 2018-08-15T16:56:48Z

@cwhelan @droazen we should co-ordinate trying to re-run Chris's workflow on a PAPI v2 backend.

droazen · 2018-08-15T16:58:14Z

@ruchim Agreed -- feel free to grab some free time on our calendars.

droazen · 2018-08-24T20:00:39Z

This was resolved by a patch to PAPIv1 just released by Google (see https://partnerissuetracker.corp.google.com/issues/112704449). Closing!

lbergelson added PRIORITY_HIGH NIO bug labels Aug 8, 2018

lbergelson changed the title ~~GATK failing in production with google ExecutionException~~ GATK failing in FireCloud with google ExecutionException Aug 8, 2018

droazen mentioned this issue Aug 10, 2018

Update our google-cloud-java fork to 0.20.5-alpha-GCS-RETRY-FIX #5099

Merged

droazen closed this as completed in #5099 Aug 13, 2018

droazen mentioned this issue Aug 14, 2018

google-cloud-nio: retry on 502 errors, and increase max depth when doing channel reopens googleapis/google-cloud-java#3557

Merged

droazen reopened this Aug 14, 2018

droazen changed the title ~~GATK failing in FireCloud with google ExecutionException~~ GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) Aug 14, 2018

droazen self-assigned this Aug 24, 2018

droazen modified the milestone: Engine-2Q2018 Aug 24, 2018

droazen closed this as completed Aug 24, 2018

jean-philippe-martin mentioned this issue Oct 31, 2019

Funcotator shuts down #6182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094

GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094

lbergelson commented Aug 8, 2018

magicDGS commented Aug 8, 2018

lbergelson commented Aug 8, 2018

magicDGS commented Aug 8, 2018

ldgauthier commented Aug 8, 2018

droazen commented Aug 10, 2018

droazen commented Aug 14, 2018

droazen commented Aug 14, 2018

droazen commented Aug 14, 2018 •

edited

Loading

cwhelan commented Aug 14, 2018

ruchim commented Aug 15, 2018

droazen commented Aug 15, 2018

droazen commented Aug 24, 2018

GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094

GATK failing in FireCloud with google ExecutionException (caused by UnknownHostException) #5094

Comments

lbergelson commented Aug 8, 2018

magicDGS commented Aug 8, 2018

lbergelson commented Aug 8, 2018

magicDGS commented Aug 8, 2018

ldgauthier commented Aug 8, 2018

droazen commented Aug 10, 2018

droazen commented Aug 14, 2018

droazen commented Aug 14, 2018

droazen commented Aug 14, 2018 • edited Loading

cwhelan commented Aug 14, 2018

ruchim commented Aug 15, 2018

droazen commented Aug 15, 2018

droazen commented Aug 24, 2018

droazen commented Aug 14, 2018 •

edited

Loading