-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues on S3 with publishDir and a large number of output files to be copied #856
Comments
Here it would interesting to know the granularity of the produced files also because it could affect the performance of the multipart uploads. |
In this specific case the files sizes were not big...on average the files were of few MBs each. So I don't think it's a multipart uploads issues, but probably it's more an issue related with the number of API calls / second. |
Provided the target of the Things to verify: |
Maybe a solution to fast move files across buckets? |
@fstrozzi Do you have an example nf script available ? |
@tbugfinder I created this process to simulate this issue. You can probably reduce number of generated files.
Update: |
I'm looking at this. In principle, the uploads should be managed in parallel using a thread pool however it looks only one thread is used:
|
Ah! That's very interesting. Thanks for looking into this @pditommaso To update on my (unhelpful) comments from Gitter, my only update on fixing this is isolating the behavior to this function:
Anecdotally, others have noticed that "copy to publishDir is one of the longest steps in working with a nextflow pipeline, especially when using the 'resume' function". This behavior became apparent on AWS Batch running 100s of samples |
Based on some chatter on the Twitter, it appears this issue could be solved via few fixes to ThreadPoolExecutor, no? :) CC @pditommaso |
Yes, it looks a matter of thread pool configuration. I'll provide a patch in the following days. |
I've uploaded a snapshot that should improve s3 uploads and possibly solve this issue. You can test it using the following command:
|
Great, thank you @pditommaso . We will test it at the earliest occasion. |
Thank you for the help on this @pditommaso We’ll give it a try soon |
This commit solves a problem with the thread pool used by the publishDir directive that was causing a low upload rate when uploading/copying multiple files to a remote storage location.
This version implements a better thread pooling stratgey to speedup S3 file uploads. See nextflow-io/nextflow-s3fs@199f2b2
Hi Paolo, I tested the fix with test which produce 200 files each 1GB with
|
It looks a problem with the S3 thread pool. This looks interesting. Also, this may be interesting to investigate. You may want to try to tune the thread pool settings using the uploadXxx options here. If you do please include the
|
I tried to replicating by running the same test multiple times in last few days, and it looks that this problem was temporary. It probably makes sense to close the ticket and, reopen it if we are able to reproduce it. |
Ok, please open a separate issue if needed. |
Thanks for the help @pditommaso |
I have this problem recently. In my case, I have 2
I tried increasing these limits but it didn't help. I think it's because these limits are for transfer to S3 only. Is there a way to improve this and make sure Nextflow publishes all output properly no matter file size or publish directory? |
Hi @bounlu , the thread pool that publishes files has a default timeout of 12 hours. You can increase it by setting threadPool.PublishDir.maxAwait = 24.h |
Received the "Timeout waiting for connection from pool" error too recently while transferring a 4GB file to s3. The job was failed in about 15 minutes. Doubt if "threadPool.PublishDir.maxAwait = 24.h" works. |
This did not help. I am still having the same issue with large files. Any solution yet? |
One thing you all can try is to enable virtual threads. See this blog post for details, but here's the gist:
Then virtual threads will be enabled automatically. I have done some benchmarks and found that this feature can significantly reduce the time to publish files at the end of the workflow, especially when copying from S3 to S3. Haven't done any benchmarks for Google Cloud Storage but there might be some benefit, worth trying in any case. |
Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing |
I read your post, and I think setting virtual threads did the trick:
|
Hello i have similar problem with nf core sarek pipeline. When i start the execution of the pipeline with the following command :
I get the following error :
Setting virtual threads will solve the problem and if yes how should o do it ? |
Hello, I have the same issue, openjdk 22, nextflow 24 (so i assume the virtual threads are enabled by default, but also tried setting them explicitly) please help |
|
in that case, aws gets overwhelmed and start trowing slow downs. In another thread, it was suggested to decrease the maxconnections. Apart from that, i just realized we run nextflow with these tweaks:
could that cause the timeout waiting for connection from pool? (we have thousands of processes running) Asking because testing it myself is a bit expensive :) |
If your max connections is high and Nextflow isn't running out of memory, the bottleneck is likely the network connection. I would try using a VM with more network bandwidth |
you mean JVM or host machine with more network bandwidth? |
host machine |
Bug report
Expected behavior and actual behavior
When executing a workflow on AWS Batch that creates thousands of files, the API calls on S3 to perform the copy or move operations of the outputs to the
publishDir
location should happen in a larger thread pool to speed up the process at the maximum.Currently when executing a workflow that creates thousands of files, and that uses the
publishDir
directive, this results in a very long waiting time for Nextflow to complete the copy operations across S3.For instance, in a specific case I was executing a workflow creating more than 130k files (the number of jobs run on AWS Batch on the other side was very low, around 40 jobs), the workflow execution took a couple of hours, but Nextflow after more than 5 hours was still copying the files to the
publishDir
location.Program output
There are no errors in the log file, all the tasks are completed correctly and the end of the Nextflow log file looks like this:
No more information is written in the logs while Nextflow is issuing the copy commands on S3 to transfer the output files to the
publishDir
location.Steps to reproduce the problem
Run a simple workflow on AWS Batch with few jobs that creates at least 10k files and that uses the
publishDir
directive to copy all the output files to a final S3 location.Environment
Version: 0.31.1 build 4886
Modified: 07-08-2018 15:53 UTC (17:53 CEST)
System: Mac OS X 10.11.6
Runtime: Groovy 2.4.15 on OpenJDK 64-Bit Server VM 1.8.0_121-b15
Encoding: UTF-8 (UTF-8)
The text was updated successfully, but these errors were encountered: