You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently observed Apache Spark 3.1 jobs hang on write using connector 2.2.4 on Dataproc 2.0. Here is a thread dump of a 'stuck' executor which will hang forever:
The FutureTask.get() is the BatchHelper's list of HttpRequest in GoogleCloudStorageImpl.getItemInfos(GoogleCloudStorageImpl.java:1891)
GoogleCloudStorageImpl uses RetryHttpInitializer which does not set a write timeout. The default HttpRequest write timeout is infinite.
I added a rather immature simulation of the hang in com.google.api.client.http.HttpRequest and the BatchHelper heap appears the same:
"main@1" prio=5 tid=0x1 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFromFuture(GoogleCloudStorageFileSystem.java:892)
at com.google.cloud.hadoop.gcsio.BatchHelper.awaitRequestsCompletion(BatchHelper.java:266)
at com.google.cloud.hadoop.gcsio.BatchHelper.flushIfPossible(BatchHelper.java:206)
at com.google.cloud.hadoop.gcsio.BatchHelper.flush(BatchHelper.java:237)
at com.google.cloud.hadoop.gcsio.BatchHelperTest.lockTest(BatchHelperTest.java:272)
Does it make sense to add a configurable write timeout to RetryHttpInitializer or would the operations not be idempotent? I admit that the write timeout didn't help my artificial test so I'm asking the question sooner to negate it as an option.
The text was updated successfully, but these errors were encountered:
Write timeout works only for PUT and POST requests and implemented using per-write thread/executor that could have negative performance implications, that's why we didn't make use of it.
I think that more systemic approach in #687 would be a preferable solution to this issue.
We recently observed Apache Spark 3.1 jobs hang on write using connector 2.2.4 on Dataproc 2.0. Here is a thread dump of a 'stuck' executor which will hang forever:
Here is the
BatchHelper
thread of the same dump:The
FutureTask.get()
is theBatchHelper
's list ofHttpRequest
inGoogleCloudStorageImpl.getItemInfos(GoogleCloudStorageImpl.java:1891)
GoogleCloudStorageImpl
usesRetryHttpInitializer
which does not set a write timeout. The defaultHttpRequest
write timeout is infinite.I added a rather immature simulation of the hang in
com.google.api.client.http.HttpRequest
and theBatchHelper
heap appears the same:Does it make sense to add a configurable write timeout to RetryHttpInitializer or would the operations not be idempotent? I admit that the write timeout didn't help my artificial test so I'm asking the question sooner to negate it as an option.
The text was updated successfully, but these errors were encountered: