Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156

Closed
jrandall opened this issue Nov 28, 2023 · 5 comments · Fixed by nf-core/modules#4496 or #158
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jrandall
Copy link

jrandall commented Nov 28, 2023

Description of the bug

The bclconvert module passes --bcl-num-parallel-tiles ${task.cpus} to bcl-convert

--bcl-num-parallel-tiles ${task.cpus}

The behaviour of this argument --bcl-num-parallel-tiles seems to be somewhat underdocumented. The illumina documentation merely says "Number of tiles processed in parallel" in the entry under the "Software Performance" heading at: https://support-docs.illumina.com/SW/DRAGEN_v40/Content/SW/DRAGEN/CommandLineOptions.htm

However, some experimentation suggests that setting --bcl-num-parallel-tiles to the number of CPUs on the machine is not a good idea. There are several other options that also control the number of threads, and the default for all of these is "Determined dynamically" according to illumina.

They also go on to say:

"It is recommended to only adjust CPU threads when reducing cores used on a shared machine. The total number of CPU-intensive threads used will be: --bcl-num-parallel-tiles * --bcl-num-conversion-threads + --bcl-num-compression-threads + --bcl-num-decompression-threads."

The online help from the bcl-convert CLI gives a bit more detail:

$ docker run quay.io/nf-core/bclconvert:4.2.4 bcl-convert -h
bcl-convert Version 00.000.000.4.2.4
Copyright (c) 2014-2022 Illumina, Inc.

Run BCL Conversion (BCL directory to *.fastq.gz)
  bcl-convert --bcl-input-directory <BCL_ROOT_DIR> --output-directory <PATH> [options]

Options:
  -h [ --help ]                          Print this help message
  -V [ --version ]                       Print the version and exit
  --output-directory arg                 Output BCL directory for BCL conversion (must be specified)
  -f [ --force ]                         Force: allow destination diretory to already exist
  --bcl-input-directory arg              Input BCL directory for BCL conversion (must be specified)
  --sample-sheet arg                     Path to SampleSheet.csv file (default searched for in --bcl-input-directory)
  --bcl-only-lane arg                    Convert only specified lane number (default all lanes)
  --strict-mode arg                      Abort if any files are missing (false by default)
  --first-tile-only arg                  Only convert first tile of input (for testing & debugging)
  --tiles arg                            Process only a subset of tiles by a regular expression
  --exclude-tiles arg                    Exclude set of tiles by a regular expression
  --bcl-sampleproject-subdirectories arg Output to subdirectories based upon sample sheet 'Sample_Project' column
  --sample-name-column-enabled arg       Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
  --fastq-gzip-compression-level arg     Set fastq output compression level 0-9 (default 1)
  --shared-thread-odirect-output arg     Use linux native asynchronous io (io_submit) for file output (Default=false)
  --bcl-num-parallel-tiles arg           # of tiles to process in parallel (default 1)
  --bcl-num-conversion-threads arg       # of threads for conversion (per tile, default # cpu threads)
  --bcl-num-compression-threads arg      # of threads for fastq.gz output compression (per tile, default # cpu threads,
                                         or HW+12)
  --bcl-num-decompression-threads arg    # of threads for bcl/cbcl input decompression (per tile, default half # cpu
                                         threads, or HW+8. Only applies when preloading files)
  --bcl-only-matched-reads arg           For pure BCL conversion, do not output files for 'Undetermined' [unmatched]
                                         reads (output by default)
  --run-info arg                         Path to RunInfo.xml file (default root of BCL input directory)
  --no-lane-splitting arg                Do not split FASTQ file by lane (false by default)
  --num-unknown-barcodes-reported arg    # of Top Unknown Barcodes to output (1000 by default)
  --bcl-validate-sample-sheet-only arg   Only validate RunInfo.xml & SampleSheet files (produce no FASTQ files)
  --output-legacy-stats arg              Also output stats in legacy (bcl2fastq2) format (false by default)
  --no-sample-sheet arg                  Enable legacy no-sample-sheet operation (No demux or trimming. No settings
                                         supported. False by default, not recommended

I don't quite get what the HW means here, but I think the intent is probably there will be at least 12 compression and 8 decompression threads even on a single core machine.

My observations of the behaviour of bcl-convert under various options is consistent with this documentation. There seem to be a fixed number of compression and decompressions threads globally, and then a set of conversion threads for each tile (the on-line help seems to say these are per tile which disagrees with the documentation, so I'm not entirely sure which is correct). The defaults on an n-core machine are to do one tile at a time and have n conversion threads, n compression threads, and n/2 decompression threads. If one runs bcl-convert without any --bcl-num-parallel-tiles option or with --bcl-num-parallel-tiles 1, bcl-convert is already able to make good use of all cores on large machines. E.g. on a 64 core machine it seems that it operates 64 conversion threads, 64 compression threads, and 32 decompression threads, and the machine stays well utilized during operation (though of course I would expect this to depend somewhat on storage throughput).

On a 64 core machine, the current nf-core module sets --bcl-num-parallel-tiles 64 which causes a total of 4192 threads to contend for CPU time and I/O. These appear to consist of:

  • 4096 conversion threads (64 parallel tiles * 64 conversion threads per tile)
  • 64 compression threads
  • 32 decompression threads

There are two issues here.

One is that 4192 threads is an unnecessarily large number of threads relative to a 64 core machine, the overhead of which is likely to be non-trivial. However, on a modern linux system, overloading CPU by a few orders of magnitude like this would not be catastrophically bad if it was purely computational work. Presumably, though, each of these conversion threads are doing independent I/O and this is likely to result in somewhat dramatic reduction in throughput depending on on how well the underlying storage can cope with so many open file handles and independent read positions.

The other, and I believe much more major issue, is that the compression and decompression threads do not seem to be automatically scaled with the number of conversion threads. As a result, system utilisation in this scenario becomes abysmally low. It seems the conversion threads spend nearly all of their time blocked waiting on access to one of the compression or decompression threads, and the result is that bcl-convert runs much more slowly than with --bcl-num-parallel-tiles 1. I'm not completely sure that there isn't some other fixed resource that they are waiting for (such as some kind of counting semaphore), but in any case the result of running on a machine with many cores and --bcl-num-parallel-tiles set to the number of cores is to run much much more slowly than with --bcl-num-parallel-tiles 1.

My recommendation would be to eliminate the setting of --bcl-num-parallel-tiles entirely, and revert to the default behaviour of bcl-convert (which is to process a single tile at a time). This is likely to result in performance improvements in pretty much every scenario where the process is intended to use all detected CPU cores. If it is essential that nf-core actively limit the number of threads, then all four relevant settings (--bcl-num-parallel-tiles, --bcl-num-conversion-threads, --bcl-num-compression-threads, --bcl-num-decompression-threads) should be explicitly set to sensible values.

Command used and terminal output

No response

Relevant files

No response

System information

No response

@jrandall jrandall added the bug Something isn't working label Nov 28, 2023
@jrandall
Copy link
Author

It is also worth noting that setting --bcl-num-parallel-tiles to a non-default value also dramatically increases memory usage. The documented minimum system requirement of 64GiB RAM seems to only be valid with the default option --bcl-num-parallel-tiles 1

@edmundmiller
Copy link
Collaborator

Awesome, thanks for the extremely detailed report and digging into this! I can move this issue over to nf-core/modules if you'd like, where-ever we think there would be more visability.

As for a PR I think if it's really simple, could you make a PR to nf-core modules and then a PR bumping the module in demultiplex?

@berguner
Copy link

Wow, that's a very deep dive! I managed to demux 150PE S4 flowcells on AWS Batch with by the config below. The execution times were around 1 hour for each lane. However, the jobs took 6-7 hours in total because the whole flowcell had to be copied to the instance. It would be amazing if we could copy the BCL files of only the relevant lane.

process {
        withName:BCLCONVERT {
                cpus = 8
                memory = 120.GB
        }
}

FelixKrueger added a commit to FelixKrueger/nf-core-modules that referenced this issue Nov 29, 2023
Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures

nf-core/demultiplex#156
@FelixKrueger
Copy link

Hi @emiller88, if you think this would be better mentioned over at modules/bclconvert please go ahead and move it. I have just created a PR for this. Cheers

github-merge-queue bot pushed a commit to nf-core/modules that referenced this issue Nov 30, 2023
* remove parameter --bcl-num-parallel-tiles

Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures

nf-core/demultiplex#156

* style: Remove environment.yml

* style: Fix license

---------

Co-authored-by: Edmund Miller <[email protected]>
@edmundmiller edmundmiller added this to the 1.4.0 milestone Nov 30, 2023
@edmundmiller
Copy link
Collaborator

Reopening until @FelixKrueger makes a PR to bump the module in the pipeline 😉

@edmundmiller edmundmiller reopened this Nov 30, 2023
Aratz added a commit to Aratz/nf-core-demultiplex that referenced this issue Dec 12, 2023
@edmundmiller edmundmiller linked a pull request Dec 12, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants