Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool #924

Closed
1 of 2 tasks
www2388258980 opened this issue Apr 17, 2023 · 7 comments · Fixed by #1361
Closed
1 of 2 tasks
Labels
bug Something isn't working

Comments

@www2388258980
Copy link
Contributor

www2388258980 commented Apr 17, 2023

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

paimon0.4

Compute Engine

flink1.16

Minimal reproduce step

描述
/bin/yarn-session.sh --detached \
-Dtaskmanager.memory.process.size=5000m \
-Dtaskmanager.memory.managed.size=0m \
-Dtaskmanager.memory.network.min=80m \
-Dtaskmanager.memory.network.max=80m \
-Dtaskmanager.numberOfTaskSlots=4

flink on yarn

文件系统使用s3,使用paimon构建实时数仓分层,比如会查几张ods的paimon表,写入到一张'merge-engine' = 'partial-update'的大宽表.
运行一段时间,半小时或者1小时以上。
其他任务(flink cdc)插入到s3://xxxxxxx/hadoop/warehouse/ods_medatc_fts.db/src_public_comments/schema/schema-0任务正常。

ava.io.UncheckedIOException: java.io.InterruptedIOException: getFileStatus on s3://xxxxxxx/hadoop/warehouse/ods_medatc_fts.db/src_public_comments/schema/schema-0: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.paimon.schema.SchemaManager.schema(SchemaManager.java:460)
at org.apache.paimon.operation.KeyValueFileStoreRead.(KeyValueFileStoreRead.java:88)
at org.apache.paimon.KeyValueFileStore.newRead(KeyValueFileStore.java:84)
at org.apache.paimon.table.ChangelogWithKeyFileStoreTable.newRead(ChangelogWithKeyFileStoreTable.java:193)
at org.apache.paimon.table.source.ReadBuilderImpl.newRead(ReadBuilderImpl.java:81)
at org.apache.paimon.flink.source.FlinkSource.createReader(FlinkSource.java:50)
at org.apache.flink.streaming.api.operators.SourceOperator.initReader(SourceOperator.java:286)
at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask.init(SourceOperatorStreamTask.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:692)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:669)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:904)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InterruptedIOException: getFileStatus on s3://xxxxxxx/hadoop/warehouse/ods_medatc_fts.db/src_public_comments/schema/schema-0: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:395)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:201)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
at org.apache.hadoop.fs.s3a.S3AFileSystem.extractOrFetchSimpleFileStatus(S3AFileSystem.java:5401)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1465)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1441)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.paimon.s3.HadoopCompliantFileIO.newInputStream(HadoopCompliantFileIO.java:47)
at org.apache.paimon.fs.PluginFileIO.lambda$newInputStream$0(PluginFileIO.java:47)
at org.apache.paimon.fs.PluginFileIO.wrap(PluginFileIO.java:104)
at org.apache.paimon.fs.PluginFileIO.newInputStream(PluginFileIO.java:47)
at org.apache.paimon.fs.FileIO.readFileUtf8(FileIO.java:173)
at org.apache.paimon.schema.SchemaManager.schema(SchemaManager.java:458)
... 14 more
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1219)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1165)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2545)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2533)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2513)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3776)
... 25 more
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:316)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazonaws.http.conn.$Proxy46.get(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)

What doesn't meet your expectations?

repair

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@www2388258980 www2388258980 added the bug Something isn't working label Apr 17, 2023
@JingsongLi
Copy link
Contributor

JingsongLi commented Apr 17, 2023

aws/aws-sdk-java#1405 Too much parallelism may cause this problem

@JingsongLi
Copy link
Contributor

@www2388258980
Copy link
Contributor Author

www2388258980 commented Apr 17, 2023

@www2388258980
Copy link
Contributor Author

aws/aws-sdk-java#1405 Too much parallelism may cause this problem

job平行度是1,一个taskmanager里面有3个job在跑。

@JingsongLi
Copy link
Contributor

We may can try fs.s3a.connection.maximum=1000

@www2388258980
Copy link
Contributor Author

小文件过多会导致s3连接池不够用,可以通过fs.s3a.connection.maximum提高连接池数量。
参考文档:
【1】https://paimon.apache.org/docs/master/maintenance/expiring-snapshots/
【2】https://www.infoq.cn/article/dytkx8luglcu9a81f58q
【3】https://docs.aws.amazon.com/zh_cn/sdk-for-java/latest/developer-guide/best-practices.html
【4】https://zhuanlan.zhihu.com/p/559718865

@JingsongLi
Copy link
Contributor

#1037 also fixed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants