Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues on S3 with publishDir and a large number of output files to be copied #856

Closed
fstrozzi opened this issue Sep 5, 2018 · 32 comments

Comments

@fstrozzi
Copy link
Contributor

fstrozzi commented Sep 5, 2018

Bug report

Expected behavior and actual behavior

When executing a workflow on AWS Batch that creates thousands of files, the API calls on S3 to perform the copy or move operations of the outputs to the publishDir location should happen in a larger thread pool to speed up the process at the maximum.

Currently when executing a workflow that creates thousands of files, and that uses the publishDir directive, this results in a very long waiting time for Nextflow to complete the copy operations across S3.

For instance, in a specific case I was executing a workflow creating more than 130k files (the number of jobs run on AWS Batch on the other side was very low, around 40 jobs), the workflow execution took a couple of hours, but Nextflow after more than 5 hours was still copying the files to the publishDir location.

Program output

There are no errors in the log file, all the tasks are completed correctly and the end of the Nextflow log file looks like this:

Sep-05 11:46:28.395 [main] DEBUG nextflow.Session - Session await > all process finished
Sep-05 11:46:28.425 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: awsbatch)
Sep-05 11:46:28.425 [main] DEBUG nextflow.Session - Session await > all barriers passed

No more information is written in the logs while Nextflow is issuing the copy commands on S3 to transfer the output files to the publishDir location.

Steps to reproduce the problem

Run a simple workflow on AWS Batch with few jobs that creates at least 10k files and that uses the publishDir directive to copy all the output files to a final S3 location.

Environment

Version: 0.31.1 build 4886
Modified: 07-08-2018 15:53 UTC (17:53 CEST)
System: Mac OS X 10.11.6
Runtime: Groovy 2.4.15 on OpenJDK 64-Bit Server VM 1.8.0_121-b15
Encoding: UTF-8 (UTF-8)

@pditommaso
Copy link
Member

Here it would interesting to know the granularity of the produced files also because it could affect the performance of the multipart uploads.

@fstrozzi
Copy link
Contributor Author

fstrozzi commented Sep 7, 2018

In this specific case the files sizes were not big...on average the files were of few MBs each. So I don't think it's a multipart uploads issues, but probably it's more an issue related with the number of API calls / second.

@pditommaso
Copy link
Member

Provided the target of the publishDir operation is a S3 bucket, the operation should be fast, being a copy from a s3 backet path to another.

Things to verify:

  1. the S3 client makes proper use of COPY API call
  2. speedup the operation for many files may be improved mimic the aws cli sync command. However it looks there's no a specific API for sync operation.

@pditommaso
Copy link
Member

Maybe a solution to fast move files across buckets?

@tbugfinder
Copy link
Contributor

@fstrozzi Do you have an example nf script available ?

@sivkovic
Copy link
Contributor

sivkovic commented Jun 17, 2019

@tbugfinder I created this process to simulate this issue. You can probably reduce number of generated files.

process CreateFiles {
  container = "ubuntu:latest"
  publishDir "${params.outDir}/test_lot_of_files" // params.outDir needs to be s3 path
  
  output:
    file '*.txt' into output_files

  script:

  """
  time for x in {1..1000}; do
    dd if=/dev/zero ibs=4k count=1 of=\$x.txt 2>/dev/null
  done
  """
}

Update:
This test for 10000 files, each 4kB in size needed additional 32 minutes after process is completed to move them to publishDir.

@pditommaso
Copy link
Member

I'm looking at this. In principle, the uploads should be managed in parallel using a thread pool however it looks only one thread is used:

Jun-24 22:35:32.071 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file1.txt -[COPY]-> /cbcrg-eu/work/file1.txt
Jun-24 22:35:38.092 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file2.txt -[COPY]-> /cbcrg-eu/work/file2.txt
Jun-24 22:35:39.788 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file3.txt -[COPY]-> /cbcrg-eu/work/file3.txt
Jun-24 22:35:41.126 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file4.txt -[COPY]-> /cbcrg-eu/work/file4.txt

@evanbiederstedt
Copy link

I'm looking at this. In principle, the uploads should be managed in parallel using a thread pool however it looks only one thread is used:

Ah! That's very interesting. Thanks for looking into this @pditommaso

To update on my (unhelpful) comments from Gitter, my only update on fixing this is isolating the behavior to this function:

log.trace "publishing file: $source -[$mode]-> $destination"

Anecdotally, others have noticed that "copy to publishDir is one of the longest steps in working with a nextflow pipeline, especially when using the 'resume' function". This behavior became apparent on AWS Batch running 100s of samples

@evanbiederstedt
Copy link

Based on some chatter on the Twitter, it appears this issue could be solved via few fixes to ThreadPoolExecutor, no? :)

CC @pditommaso

@pditommaso
Copy link
Member

Yes, it looks a matter of thread pool configuration. I'll provide a patch in the following days.

@pditommaso
Copy link
Member

I've uploaded a snapshot that should improve s3 uploads and possibly solve this issue. You can test it using the following command:

NXF_VER=19.07.0-SNAPSHOT nextflow run .. etc

@fstrozzi
Copy link
Contributor Author

fstrozzi commented Jul 4, 2019

Great, thank you @pditommaso . We will test it at the earliest occasion.

@evanbiederstedt
Copy link

Thank you for the help on this @pditommaso

We’ll give it a try soon

pditommaso added a commit that referenced this issue Jul 9, 2019
This commit solves a problem with the thread pool used
by the publishDir directive that was causing a low upload
rate when uploading/copying multiple files to a remote
storage location.
pditommaso added a commit that referenced this issue Jul 9, 2019
This version implements a better thread pooling stratgey
to speedup S3 file uploads.

See nextflow-io/nextflow-s3fs@199f2b2
@pditommaso pditommaso added this to the v19.07.0 milestone Jul 9, 2019
@sivkovic
Copy link
Contributor

Hi Paolo,

I tested the fix with test which produce 200 files each 1GB with 19.07.0. Speed up is significant, (00:01:56.94 in comparison to 00:21:24.043), but I encounter the problem. With 19.0.7 I'm repeatedly getting the error below. And some files failed to be published.

Jul-11 17:42:00.249 [FileTransfer-thread-29] WARN  nextflow.processor.PublishDir - Failed to publish file: /bucket/workdir/26/7461714aaa6315b31da02451a66ea9/34.2.txt; to: /bucket/output/test_lot_of_files/34.2.txt [copy] -- See log file for details
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4914)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4860)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1320)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1294)
        at com.upplication.s3fs.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:181)
        at com.upplication.s3fs.S3FileSystemProvider.copy(S3FileSystemProvider.java:529)
        at nextflow.file.FileHelper.copyPath(FileHelper.groovy:852)
        at nextflow.processor.PublishDir.processFileImpl(PublishDir.groovy:283)
        at nextflow.processor.PublishDir.processFile(PublishDir.groovy:254)
        at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:240)
        at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1217)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1041)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1011)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:994)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
        at nextflow.processor.PublishDir$_apply_closure1.doCall(PublishDir.groovy:213)
        at nextflow.processor.PublishDir$_apply_closure1.call(PublishDir.groovy)
        at groovy.lang.Closure.run(Closure.java:486)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:313)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279)
        at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy22.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
        ... 35 common frames omitted

@pditommaso
Copy link
Member

It looks a problem with the S3 thread pool. This looks interesting. Also, this may be interesting to investigate.

You may want to try to tune the thread pool settings using the uploadXxx options here.

If you do please include the .nextflow.log generated using the following option

nextflow -trace com.upplication.s3fs run .. etc

@pditommaso pditommaso reopened this Jul 15, 2019
@sivkovic
Copy link
Contributor

I tried to replicating by running the same test multiple times in last few days, and it looks that this problem was temporary. It probably makes sense to close the ticket and, reopen it if we are able to reproduce it.

@pditommaso
Copy link
Member

Ok, please open a separate issue if needed.

@evanbiederstedt
Copy link

Thanks for the help @pditommaso

@bounlu
Copy link
Contributor

bounlu commented Oct 26, 2023

I have this problem recently. In my case, I have 2 publishDir, one is to copy to S3 and the other is to copy to local server. Each file is like 100GB and I have hundreds of them. Unfortunately, the files are not copied to local server properly, they are either corrupted or failed. The worst is, if publishing fails, all the other tasks are immediately killed by Nextflow, including long running jobs on AWS Batch.

ERROR ~ Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_COVERAGE2CYTOSINE (XXX)'

Caused by:
  Timeout waiting for connection from pool


 -- Check '.nextflow.log' file for details
-[nf-core/methylseq] Pipeline completed with errors-
WARN: Killing running tasks (212)

I tried increasing these limits but it didn't help. I think it's because these limits are for transfer to S3 only.

Is there a way to improve this and make sure Nextflow publishes all output properly no matter file size or publish directory?

@bentsherman
Copy link
Member

Hi @bounlu , the thread pool that publishes files has a default timeout of 12 hours. You can increase it by setting threadPool.PublishDir.maxAwait in your Nextflow configuration:

threadPool.PublishDir.maxAwait = 24.h

@xmeng
Copy link

xmeng commented Oct 27, 2023

Received the "Timeout waiting for connection from pool" error too recently while transferring a 4GB file to s3. The job was failed in about 15 minutes. Doubt if "threadPool.PublishDir.maxAwait = 24.h" works.

@bounlu
Copy link
Contributor

bounlu commented Feb 7, 2024

Hi @bounlu , the thread pool that publishes files has a default timeout of 12 hours. You can increase it by setting threadPool.PublishDir.maxAwait in your Nextflow configuration:

threadPool.PublishDir.maxAwait = 24.h

This did not help. I am still having the same issue with large files. Any solution yet?

@bentsherman
Copy link
Member

One thing you all can try is to enable virtual threads. See this blog post for details, but here's the gist:

  • Install Java 21
  • Install Nextflow 23.10 or later

Then virtual threads will be enabled automatically. I have done some benchmarks and found that this feature can significantly reduce the time to publish files at the end of the workflow, especially when copying from S3 to S3. Haven't done any benchmarks for Google Cloud Storage but there might be some benefit, worth trying in any case.

@bentsherman
Copy link
Member

Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing aws.client.maxConnections, I believe the default is 50.

@bounlu
Copy link
Contributor

bounlu commented Feb 8, 2024

I read your post, and I think setting virtual threads did the trick:

export NXF_ENABLE_VIRTUAL_THREADS="true"

@zisis86
Copy link

zisis86 commented Mar 27, 2024

Hello i have similar problem with nf core sarek pipeline. When i start the execution of the pipeline with the following command :

Desktop % nextflow run nf-core/sarek -r 3.4.0 -profile docker --input AC1/vcf_single_AC1.csv --outdir testResutsAC1_new --step annotate --tools snpeff --max_memory '8.GB'

I get the following error :

ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_KNOWN_INDELS (1)'

Caused by:
  Timeout waiting for connection from pool

Setting virtual threads will solve the problem and if yes how should o do it ?

@legezam
Copy link

legezam commented Jun 11, 2024

Hello,

I have the same issue, openjdk 22, nextflow 24 (so i assume the virtual threads are enabled by default, but also tried setting them explicitly)

please help

@bentsherman
Copy link
Member

Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing aws.client.maxConnections, I believe the default is 50.

@legezam
Copy link

legezam commented Jun 12, 2024

Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing aws.client.maxConnections, I believe the default is 50.

in that case, aws gets overwhelmed and start trowing slow downs. In another thread, it was suggested to decrease the maxconnections.

Apart from that, i just realized we run nextflow with these tweaks:

export NFX_OPTS='-Xms500m -Xmx1024m'

could that cause the timeout waiting for connection from pool? (we have thousands of processes running) Asking because testing it myself is a bit expensive :)

@bentsherman
Copy link
Member

If your max connections is high and Nextflow isn't running out of memory, the bottleneck is likely the network connection. I would try using a VM with more network bandwidth

@legezam
Copy link

legezam commented Jun 12, 2024

If your max connections is high and Nextflow isn't running out of memory, the bottleneck is likely the network connection. I would try using a VM with more network bandwidth

you mean JVM or host machine with more network bandwidth?

@bentsherman
Copy link
Member

host machine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants