bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156

jrandall · 2023-11-28T13:37:36Z

Description of the bug

The bclconvert module passes --bcl-num-parallel-tiles ${task.cpus} to bcl-convert

demultiplex/modules/nf-core/bclconvert/main.nf

Line 65 in 67b8465

--bcl-num-parallel-tiles ${task.cpus}

The behaviour of this argument --bcl-num-parallel-tiles seems to be somewhat underdocumented. The illumina documentation merely says "Number of tiles processed in parallel" in the entry under the "Software Performance" heading at: https://support-docs.illumina.com/SW/DRAGEN_v40/Content/SW/DRAGEN/CommandLineOptions.htm

However, some experimentation suggests that setting --bcl-num-parallel-tiles to the number of CPUs on the machine is not a good idea. There are several other options that also control the number of threads, and the default for all of these is "Determined dynamically" according to illumina.

They also go on to say:

"It is recommended to only adjust CPU threads when reducing cores used on a shared machine. The total number of CPU-intensive threads used will be: --bcl-num-parallel-tiles * --bcl-num-conversion-threads + --bcl-num-compression-threads + --bcl-num-decompression-threads."

The online help from the bcl-convert CLI gives a bit more detail:

$ docker run quay.io/nf-core/bclconvert:4.2.4 bcl-convert -h
bcl-convert Version 00.000.000.4.2.4
Copyright (c) 2014-2022 Illumina, Inc.

Run BCL Conversion (BCL directory to *.fastq.gz)
  bcl-convert --bcl-input-directory <BCL_ROOT_DIR> --output-directory <PATH> [options]

Options:
  -h [ --help ]                          Print this help message
  -V [ --version ]                       Print the version and exit
  --output-directory arg                 Output BCL directory for BCL conversion (must be specified)
  -f [ --force ]                         Force: allow destination diretory to already exist
  --bcl-input-directory arg              Input BCL directory for BCL conversion (must be specified)
  --sample-sheet arg                     Path to SampleSheet.csv file (default searched for in --bcl-input-directory)
  --bcl-only-lane arg                    Convert only specified lane number (default all lanes)
  --strict-mode arg                      Abort if any files are missing (false by default)
  --first-tile-only arg                  Only convert first tile of input (for testing & debugging)
  --tiles arg                            Process only a subset of tiles by a regular expression
  --exclude-tiles arg                    Exclude set of tiles by a regular expression
  --bcl-sampleproject-subdirectories arg Output to subdirectories based upon sample sheet 'Sample_Project' column
  --sample-name-column-enabled arg       Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
  --fastq-gzip-compression-level arg     Set fastq output compression level 0-9 (default 1)
  --shared-thread-odirect-output arg     Use linux native asynchronous io (io_submit) for file output (Default=false)
  --bcl-num-parallel-tiles arg           # of tiles to process in parallel (default 1)
  --bcl-num-conversion-threads arg       # of threads for conversion (per tile, default # cpu threads)
  --bcl-num-compression-threads arg      # of threads for fastq.gz output compression (per tile, default # cpu threads,
                                         or HW+12)
  --bcl-num-decompression-threads arg    # of threads for bcl/cbcl input decompression (per tile, default half # cpu
                                         threads, or HW+8. Only applies when preloading files)
  --bcl-only-matched-reads arg           For pure BCL conversion, do not output files for 'Undetermined' [unmatched]
                                         reads (output by default)
  --run-info arg                         Path to RunInfo.xml file (default root of BCL input directory)
  --no-lane-splitting arg                Do not split FASTQ file by lane (false by default)
  --num-unknown-barcodes-reported arg    # of Top Unknown Barcodes to output (1000 by default)
  --bcl-validate-sample-sheet-only arg   Only validate RunInfo.xml & SampleSheet files (produce no FASTQ files)
  --output-legacy-stats arg              Also output stats in legacy (bcl2fastq2) format (false by default)
  --no-sample-sheet arg                  Enable legacy no-sample-sheet operation (No demux or trimming. No settings
                                         supported. False by default, not recommended

I don't quite get what the HW means here, but I think the intent is probably there will be at least 12 compression and 8 decompression threads even on a single core machine.

My observations of the behaviour of bcl-convert under various options is consistent with this documentation. There seem to be a fixed number of compression and decompressions threads globally, and then a set of conversion threads for each tile (the on-line help seems to say these are per tile which disagrees with the documentation, so I'm not entirely sure which is correct). The defaults on an n-core machine are to do one tile at a time and have n conversion threads, n compression threads, and n/2 decompression threads. If one runs bcl-convert without any --bcl-num-parallel-tiles option or with --bcl-num-parallel-tiles 1, bcl-convert is already able to make good use of all cores on large machines. E.g. on a 64 core machine it seems that it operates 64 conversion threads, 64 compression threads, and 32 decompression threads, and the machine stays well utilized during operation (though of course I would expect this to depend somewhat on storage throughput).

On a 64 core machine, the current nf-core module sets --bcl-num-parallel-tiles 64 which causes a total of 4192 threads to contend for CPU time and I/O. These appear to consist of:

4096 conversion threads (64 parallel tiles * 64 conversion threads per tile)
64 compression threads
32 decompression threads

There are two issues here.

One is that 4192 threads is an unnecessarily large number of threads relative to a 64 core machine, the overhead of which is likely to be non-trivial. However, on a modern linux system, overloading CPU by a few orders of magnitude like this would not be catastrophically bad if it was purely computational work. Presumably, though, each of these conversion threads are doing independent I/O and this is likely to result in somewhat dramatic reduction in throughput depending on on how well the underlying storage can cope with so many open file handles and independent read positions.

The other, and I believe much more major issue, is that the compression and decompression threads do not seem to be automatically scaled with the number of conversion threads. As a result, system utilisation in this scenario becomes abysmally low. It seems the conversion threads spend nearly all of their time blocked waiting on access to one of the compression or decompression threads, and the result is that bcl-convert runs much more slowly than with --bcl-num-parallel-tiles 1. I'm not completely sure that there isn't some other fixed resource that they are waiting for (such as some kind of counting semaphore), but in any case the result of running on a machine with many cores and --bcl-num-parallel-tiles set to the number of cores is to run much much more slowly than with --bcl-num-parallel-tiles 1.

My recommendation would be to eliminate the setting of --bcl-num-parallel-tiles entirely, and revert to the default behaviour of bcl-convert (which is to process a single tile at a time). This is likely to result in performance improvements in pretty much every scenario where the process is intended to use all detected CPU cores. If it is essential that nf-core actively limit the number of threads, then all four relevant settings (--bcl-num-parallel-tiles, --bcl-num-conversion-threads, --bcl-num-compression-threads, --bcl-num-decompression-threads) should be explicitly set to sensible values.

Command used and terminal output

No response

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

jrandall · 2023-11-28T13:45:31Z

It is also worth noting that setting --bcl-num-parallel-tiles to a non-default value also dramatically increases memory usage. The documented minimum system requirement of 64GiB RAM seems to only be valid with the default option --bcl-num-parallel-tiles 1

edmundmiller · 2023-11-28T19:14:11Z

Awesome, thanks for the extremely detailed report and digging into this! I can move this issue over to nf-core/modules if you'd like, where-ever we think there would be more visability.

As for a PR I think if it's really simple, could you make a PR to nf-core modules and then a PR bumping the module in demultiplex?

berguner · 2023-11-29T10:24:23Z

Wow, that's a very deep dive! I managed to demux 150PE S4 flowcells on AWS Batch with by the config below. The execution times were around 1 hour for each lane. However, the jobs took 6-7 hours in total because the whole flowcell had to be copied to the instance. It would be amazing if we could copy the BCL files of only the relevant lane.

process {
        withName:BCLCONVERT {
                cpus = 8
                memory = 120.GB
        }
}

Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures nf-core/demultiplex#156

FelixKrueger · 2023-11-29T15:35:33Z

Hi @emiller88, if you think this would be better mentioned over at modules/bclconvert please go ahead and move it. I have just created a PR for this. Cheers

* remove parameter --bcl-num-parallel-tiles Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures nf-core/demultiplex#156 * style: Remove environment.yml * style: Fix license --------- Co-authored-by: Edmund Miller <[email protected]>

edmundmiller · 2023-11-30T20:05:28Z

Reopening until @FelixKrueger makes a PR to bump the module in the pipeline 😉

Fixes nf-core#156

jrandall added the bug Something isn't working label Nov 28, 2023

FelixKrueger added a commit to FelixKrueger/nf-core-modules that referenced this issue Nov 29, 2023

remove parameter --bcl-num-parallel-tiles

235e316

Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures nf-core/demultiplex#156

FelixKrueger mentioned this issue Nov 29, 2023

remove parameter --bcl-num-parallel-tiles nf-core/modules#4496

Merged

1 task

FelixKrueger closed this as completed in nf-core/modules#4496 Nov 30, 2023

edmundmiller assigned FelixKrueger Nov 30, 2023

edmundmiller added this to the 1.4.0 milestone Nov 30, 2023

edmundmiller reopened this Nov 30, 2023

maxulysse mentioned this issue Dec 11, 2023

Fix bcl2fastq and bclconvert publishDir #157

Merged

5 tasks

Aratz added a commit to Aratz/nf-core-demultiplex that referenced this issue Dec 12, 2023

Update all modules

1bc500a

Fixes nf-core#156

edmundmiller linked a pull request Dec 12, 2023 that will close this issue

Update all modules #158

Merged

4 tasks

edmundmiller closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156

bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156

jrandall commented Nov 28, 2023 •

edited by ewels

Loading

jrandall commented Nov 28, 2023

edmundmiller commented Nov 28, 2023

berguner commented Nov 29, 2023

FelixKrueger commented Nov 29, 2023

edmundmiller commented Nov 30, 2023

bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156

bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156

Comments

jrandall commented Nov 28, 2023 • edited by ewels Loading

Description of the bug

Command used and terminal output

Relevant files

System information

jrandall commented Nov 28, 2023

edmundmiller commented Nov 28, 2023

berguner commented Nov 29, 2023

FelixKrueger commented Nov 29, 2023

edmundmiller commented Nov 30, 2023

jrandall commented Nov 28, 2023 •

edited by ewels

Loading