Performance issues on S3 with publishDir and a large number of output files to be copied #856

fstrozzi · 2018-09-05T15:22:54Z

Bug report

Expected behavior and actual behavior

When executing a workflow on AWS Batch that creates thousands of files, the API calls on S3 to perform the copy or move operations of the outputs to the publishDir location should happen in a larger thread pool to speed up the process at the maximum.

Currently when executing a workflow that creates thousands of files, and that uses the publishDir directive, this results in a very long waiting time for Nextflow to complete the copy operations across S3.

For instance, in a specific case I was executing a workflow creating more than 130k files (the number of jobs run on AWS Batch on the other side was very low, around 40 jobs), the workflow execution took a couple of hours, but Nextflow after more than 5 hours was still copying the files to the publishDir location.

Program output

There are no errors in the log file, all the tasks are completed correctly and the end of the Nextflow log file looks like this:

Sep-05 11:46:28.395 [main] DEBUG nextflow.Session - Session await > all process finished
Sep-05 11:46:28.425 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: awsbatch)
Sep-05 11:46:28.425 [main] DEBUG nextflow.Session - Session await > all barriers passed

No more information is written in the logs while Nextflow is issuing the copy commands on S3 to transfer the output files to the publishDir location.

Steps to reproduce the problem

Run a simple workflow on AWS Batch with few jobs that creates at least 10k files and that uses the publishDir directive to copy all the output files to a final S3 location.

Environment

Version: 0.31.1 build 4886
Modified: 07-08-2018 15:53 UTC (17:53 CEST)
System: Mac OS X 10.11.6
Runtime: Groovy 2.4.15 on OpenJDK 64-Bit Server VM 1.8.0_121-b15
Encoding: UTF-8 (UTF-8)

The text was updated successfully, but these errors were encountered:

pditommaso · 2018-09-06T09:54:17Z

Here it would interesting to know the granularity of the produced files also because it could affect the performance of the multipart uploads.

fstrozzi · 2018-09-07T13:51:01Z

In this specific case the files sizes were not big...on average the files were of few MBs each. So I don't think it's a multipart uploads issues, but probably it's more an issue related with the number of API calls / second.

pditommaso · 2018-11-13T08:40:49Z

Provided the target of the publishDir operation is a S3 bucket, the operation should be fast, being a copy from a s3 backet path to another.

Things to verify:

the S3 client makes proper use of COPY API call
speedup the operation for many files may be improved mimic the aws cli sync command. However it looks there's no a specific API for sync operation.

pditommaso · 2018-11-26T15:52:02Z

Maybe a solution to fast move files across buckets?

tbugfinder · 2019-06-11T19:31:16Z

@fstrozzi Do you have an example nf script available ?

sivkovic · 2019-06-17T17:40:52Z

@tbugfinder I created this process to simulate this issue. You can probably reduce number of generated files.

process CreateFiles {
  container = "ubuntu:latest"
  publishDir "${params.outDir}/test_lot_of_files" // params.outDir needs to be s3 path
  
  output:
    file '*.txt' into output_files

  script:

  """
  time for x in {1..1000}; do
    dd if=/dev/zero ibs=4k count=1 of=\$x.txt 2>/dev/null
  done
  """
}

Update:
This test for 10000 files, each 4kB in size needed additional 32 minutes after process is completed to move them to publishDir.

pditommaso · 2019-06-24T21:38:31Z

I'm looking at this. In principle, the uploads should be managed in parallel using a thread pool however it looks only one thread is used:

Jun-24 22:35:32.071 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file1.txt -[COPY]-> /cbcrg-eu/work/file1.txt
Jun-24 22:35:38.092 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file2.txt -[COPY]-> /cbcrg-eu/work/file2.txt
Jun-24 22:35:39.788 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file3.txt -[COPY]-> /cbcrg-eu/work/file3.txt
Jun-24 22:35:41.126 [pool-3-thread-1] TRACE nextflow.processor.PublishDir - publishing file: /Users/pditommaso/Projects/nextflow/work/6c/36771a07be014f2b6517cc97bf7cdd/file4.txt -[COPY]-> /cbcrg-eu/work/file4.txt

evanbiederstedt · 2019-06-25T15:56:43Z

I'm looking at this. In principle, the uploads should be managed in parallel using a thread pool however it looks only one thread is used:

Ah! That's very interesting. Thanks for looking into this @pditommaso

To update on my (unhelpful) comments from Gitter, my only update on fixing this is isolating the behavior to this function:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/PublishDir.groovy

Line 268 in a005b1b

log.trace "publishing file: $source -[$mode]-> $destination"

Anecdotally, others have noticed that "copy to publishDir is one of the longest steps in working with a nextflow pipeline, especially when using the 'resume' function". This behavior became apparent on AWS Batch running 100s of samples

evanbiederstedt · 2019-07-02T03:08:19Z

Based on some chatter on the Twitter, it appears this issue could be solved via few fixes to ThreadPoolExecutor, no? :)

CC @pditommaso

pditommaso · 2019-07-02T09:32:46Z

Yes, it looks a matter of thread pool configuration. I'll provide a patch in the following days.

pditommaso · 2019-07-04T08:41:56Z

I've uploaded a snapshot that should improve s3 uploads and possibly solve this issue. You can test it using the following command:

NXF_VER=19.07.0-SNAPSHOT nextflow run .. etc

fstrozzi · 2019-07-04T08:48:31Z

Great, thank you @pditommaso . We will test it at the earliest occasion.

evanbiederstedt · 2019-07-04T23:23:25Z

Thank you for the help on this @pditommaso

We’ll give it a try soon

This commit solves a problem with the thread pool used by the publishDir directive that was causing a low upload rate when uploading/copying multiple files to a remote storage location.

This version implements a better thread pooling stratgey to speedup S3 file uploads. See nextflow-io/nextflow-s3fs@199f2b2

sivkovic · 2019-07-12T15:30:53Z

Hi Paolo,

I tested the fix with test which produce 200 files each 1GB with 19.07.0. Speed up is significant, (00:01:56.94 in comparison to 00:21:24.043), but I encounter the problem. With 19.0.7 I'm repeatedly getting the error below. And some files failed to be published.

Jul-11 17:42:00.249 [FileTransfer-thread-29] WARN  nextflow.processor.PublishDir - Failed to publish file: /bucket/workdir/26/7461714aaa6315b31da02451a66ea9/34.2.txt; to: /bucket/output/test_lot_of_files/34.2.txt [copy] -- See log file for details
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4914)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4860)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1320)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1294)
        at com.upplication.s3fs.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:181)
        at com.upplication.s3fs.S3FileSystemProvider.copy(S3FileSystemProvider.java:529)
        at nextflow.file.FileHelper.copyPath(FileHelper.groovy:852)
        at nextflow.processor.PublishDir.processFileImpl(PublishDir.groovy:283)
        at nextflow.processor.PublishDir.processFile(PublishDir.groovy:254)
        at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:240)
        at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1217)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1041)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1011)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:994)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
        at nextflow.processor.PublishDir$_apply_closure1.doCall(PublishDir.groovy:213)
        at nextflow.processor.PublishDir$_apply_closure1.call(PublishDir.groovy)
        at groovy.lang.Closure.run(Closure.java:486)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:313)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279)
        at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy22.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
        ... 35 common frames omitted

pditommaso · 2019-07-15T11:48:23Z

It looks a problem with the S3 thread pool. This looks interesting. Also, this may be interesting to investigate.

You may want to try to tune the thread pool settings using the uploadXxx options here.

If you do please include the .nextflow.log generated using the following option

nextflow -trace com.upplication.s3fs run .. etc

sivkovic · 2019-07-17T07:19:24Z

I tried to replicating by running the same test multiple times in last few days, and it looks that this problem was temporary. It probably makes sense to close the ticket and, reopen it if we are able to reproduce it.

pditommaso · 2019-07-17T09:58:53Z

Ok, please open a separate issue if needed.

evanbiederstedt · 2019-07-17T16:21:02Z

Thanks for the help @pditommaso

bounlu · 2023-10-26T06:24:48Z

I have this problem recently. In my case, I have 2 publishDir, one is to copy to S3 and the other is to copy to local server. Each file is like 100GB and I have hundreds of them. Unfortunately, the files are not copied to local server properly, they are either corrupted or failed. The worst is, if publishing fails, all the other tasks are immediately killed by Nextflow, including long running jobs on AWS Batch.

ERROR ~ Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_COVERAGE2CYTOSINE (XXX)'

Caused by:
  Timeout waiting for connection from pool


 -- Check '.nextflow.log' file for details
-[nf-core/methylseq] Pipeline completed with errors-
WARN: Killing running tasks (212)

I tried increasing these limits but it didn't help. I think it's because these limits are for transfer to S3 only.

Is there a way to improve this and make sure Nextflow publishes all output properly no matter file size or publish directory?

bentsherman · 2023-10-27T18:28:57Z

Hi @bounlu , the thread pool that publishes files has a default timeout of 12 hours. You can increase it by setting threadPool.PublishDir.maxAwait in your Nextflow configuration:

threadPool.PublishDir.maxAwait = 24.h

xmeng · 2023-10-27T20:36:52Z

Received the "Timeout waiting for connection from pool" error too recently while transferring a 4GB file to s3. The job was failed in about 15 minutes. Doubt if "threadPool.PublishDir.maxAwait = 24.h" works.

bounlu · 2024-02-07T09:55:11Z

Hi @bounlu , the thread pool that publishes files has a default timeout of 12 hours. You can increase it by setting threadPool.PublishDir.maxAwait in your Nextflow configuration:
threadPool.PublishDir.maxAwait = 24.h

This did not help. I am still having the same issue with large files. Any solution yet?

bentsherman · 2024-02-07T15:17:28Z

One thing you all can try is to enable virtual threads. See this blog post for details, but here's the gist:

Install Java 21
Install Nextflow 23.10 or later

Then virtual threads will be enabled automatically. I have done some benchmarks and found that this feature can significantly reduce the time to publish files at the end of the workflow, especially when copying from S3 to S3. Haven't done any benchmarks for Google Cloud Storage but there might be some benefit, worth trying in any case.

bentsherman · 2024-02-07T15:19:37Z

Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing aws.client.maxConnections, I believe the default is 50.

bounlu · 2024-02-08T02:17:26Z

I read your post, and I think setting virtual threads did the trick:

export NXF_ENABLE_VIRTUAL_THREADS="true"

zisis86 · 2024-03-27T09:42:33Z

Hello i have similar problem with nf core sarek pipeline. When i start the execution of the pipeline with the following command :

Desktop % nextflow run nf-core/sarek -r 3.4.0 -profile docker --input AC1/vcf_single_AC1.csv --outdir testResutsAC1_new --step annotate --tools snpeff --max_memory '8.GB'

I get the following error :

ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_KNOWN_INDELS (1)'

Caused by:
  Timeout waiting for connection from pool

Setting virtual threads will solve the problem and if yes how should o do it ?

legezam · 2024-06-11T20:25:12Z

Hello,

I have the same issue, openjdk 22, nextflow 24 (so i assume the virtual threads are enabled by default, but also tried setting them explicitly)

please help

bentsherman · 2024-06-12T13:22:46Z

Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing aws.client.maxConnections, I believe the default is 50.

legezam · 2024-06-12T13:34:44Z

Regarding the "Timeout waiting for connection from pool" error, you should be able to avoid this by increasing aws.client.maxConnections, I believe the default is 50.

in that case, aws gets overwhelmed and start trowing slow downs. In another thread, it was suggested to decrease the maxconnections.

Apart from that, i just realized we run nextflow with these tweaks:

export NFX_OPTS='-Xms500m -Xmx1024m'

could that cause the timeout waiting for connection from pool? (we have thousands of processes running) Asking because testing it myself is a bit expensive :)

bentsherman · 2024-06-12T13:38:24Z

If your max connections is high and Nextflow isn't running out of memory, the bottleneck is likely the network connection. I would try using a VM with more network bandwidth

legezam · 2024-06-12T13:42:17Z

If your max connections is high and Nextflow isn't running out of memory, the bottleneck is likely the network connection. I would try using a VM with more network bandwidth

you mean JVM or host machine with more network bandwidth?

bentsherman · 2024-06-12T16:33:19Z

host machine

pditommaso added kind/enhancement storage/aws labels Sep 6, 2018

pditommaso added the triage/needs-information label Sep 6, 2018

pditommaso added the nfhack18 label Nov 22, 2018

pditommaso added the executor/aws-batch label Dec 12, 2018

pditommaso mentioned this issue Jan 17, 2019

Offload publishDir data copy to job script #1002

Closed

pditommaso mentioned this issue Jun 10, 2019

Using s3 path as publishDir is really slow #1174

Closed

sivkovic mentioned this issue Jun 10, 2019

Using s3 bucket as outdir is extremely slow mskcc/tempo#304

Closed

pditommaso added a commit that referenced this issue Jul 9, 2019

Fix serial publishDir uploads #856

1477485

This commit solves a problem with the thread pool used by the publishDir directive that was causing a low upload rate when uploading/copying multiple files to a remote storage location.

pditommaso added a commit that referenced this issue Jul 9, 2019

Upgrade s3fs lib to version 1.0.6 #856

6de58e8

This version implements a better thread pooling stratgey to speedup S3 file uploads. See nextflow-io/nextflow-s3fs@199f2b2

pditommaso added this to the v19.07.0 milestone Jul 9, 2019

pditommaso closed this as completed Jul 9, 2019

evanbiederstedt mentioned this issue Jul 9, 2019

Check that the NF fix works on AWS Batch mskcc/tempo#458

Closed

pditommaso reopened this Jul 15, 2019

pditommaso closed this as completed Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on S3 with publishDir and a large number of output files to be copied #856

Performance issues on S3 with publishDir and a large number of output files to be copied #856

fstrozzi commented Sep 5, 2018 •

edited

Loading

pditommaso commented Sep 6, 2018

fstrozzi commented Sep 7, 2018 •

edited

Loading

pditommaso commented Nov 13, 2018

pditommaso commented Nov 26, 2018

tbugfinder commented Jun 11, 2019

sivkovic commented Jun 17, 2019 •

edited

Loading

pditommaso commented Jun 24, 2019

evanbiederstedt commented Jun 25, 2019

evanbiederstedt commented Jul 2, 2019

pditommaso commented Jul 2, 2019

pditommaso commented Jul 4, 2019

fstrozzi commented Jul 4, 2019

evanbiederstedt commented Jul 4, 2019

sivkovic commented Jul 12, 2019

pditommaso commented Jul 15, 2019

sivkovic commented Jul 17, 2019

pditommaso commented Jul 17, 2019

evanbiederstedt commented Jul 17, 2019

bounlu commented Oct 26, 2023 •

edited

Loading

bentsherman commented Oct 27, 2023

xmeng commented Oct 27, 2023

bounlu commented Feb 7, 2024

bentsherman commented Feb 7, 2024

bentsherman commented Feb 7, 2024

bounlu commented Feb 8, 2024

zisis86 commented Mar 27, 2024 •

edited

Loading

legezam commented Jun 11, 2024

bentsherman commented Jun 12, 2024

legezam commented Jun 12, 2024

bentsherman commented Jun 12, 2024

legezam commented Jun 12, 2024

bentsherman commented Jun 12, 2024

Performance issues on S3 with publishDir and a large number of output files to be copied #856

Performance issues on S3 with publishDir and a large number of output files to be copied #856

Comments

fstrozzi commented Sep 5, 2018 • edited Loading

Bug report

Expected behavior and actual behavior

Program output

Steps to reproduce the problem

Environment

pditommaso commented Sep 6, 2018

fstrozzi commented Sep 7, 2018 • edited Loading

pditommaso commented Nov 13, 2018

pditommaso commented Nov 26, 2018

tbugfinder commented Jun 11, 2019

sivkovic commented Jun 17, 2019 • edited Loading

pditommaso commented Jun 24, 2019

evanbiederstedt commented Jun 25, 2019

evanbiederstedt commented Jul 2, 2019

pditommaso commented Jul 2, 2019

pditommaso commented Jul 4, 2019

fstrozzi commented Jul 4, 2019

evanbiederstedt commented Jul 4, 2019

sivkovic commented Jul 12, 2019

pditommaso commented Jul 15, 2019

sivkovic commented Jul 17, 2019

pditommaso commented Jul 17, 2019

evanbiederstedt commented Jul 17, 2019

bounlu commented Oct 26, 2023 • edited Loading

bentsherman commented Oct 27, 2023

xmeng commented Oct 27, 2023

bounlu commented Feb 7, 2024

bentsherman commented Feb 7, 2024

bentsherman commented Feb 7, 2024

bounlu commented Feb 8, 2024

zisis86 commented Mar 27, 2024 • edited Loading

legezam commented Jun 11, 2024

bentsherman commented Jun 12, 2024

legezam commented Jun 12, 2024

bentsherman commented Jun 12, 2024

legezam commented Jun 12, 2024

bentsherman commented Jun 12, 2024

fstrozzi commented Sep 5, 2018 •

edited

Loading

fstrozzi commented Sep 7, 2018 •

edited

Loading

sivkovic commented Jun 17, 2019 •

edited

Loading

bounlu commented Oct 26, 2023 •

edited

Loading

zisis86 commented Mar 27, 2024 •

edited

Loading