-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcl-convert setting of --bcl-num-parallel-tiles causes very poor performance #156
Comments
It is also worth noting that setting |
Awesome, thanks for the extremely detailed report and digging into this! I can move this issue over to nf-core/modules if you'd like, where-ever we think there would be more visability. As for a PR I think if it's really simple, could you make a PR to nf-core modules and then a PR bumping the module in |
Wow, that's a very deep dive! I managed to demux 150PE S4 flowcells on AWS Batch with by the config below. The execution times were around 1 hour for each lane. However, the jobs took 6-7 hours in total because the whole flowcell had to be copied to the instance. It would be amazing if we could copy the BCL files of only the relevant lane.
|
Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures nf-core/demultiplex#156
Hi @emiller88, if you think this would be better mentioned over at modules/bclconvert please go ahead and move it. I have just created a PR for this. Cheers |
* remove parameter --bcl-num-parallel-tiles Remove parameter `--bcl-num-parallel-tiles` as it causes poor performance or run failures nf-core/demultiplex#156 * style: Remove environment.yml * style: Fix license --------- Co-authored-by: Edmund Miller <[email protected]>
Reopening until @FelixKrueger makes a PR to bump the module in the pipeline 😉 |
Description of the bug
The bclconvert module passes
--bcl-num-parallel-tiles ${task.cpus}
tobcl-convert
demultiplex/modules/nf-core/bclconvert/main.nf
Line 65 in 67b8465
The behaviour of this argument
--bcl-num-parallel-tiles
seems to be somewhat underdocumented. The illumina documentation merely says "Number of tiles processed in parallel" in the entry under the "Software Performance" heading at: https://support-docs.illumina.com/SW/DRAGEN_v40/Content/SW/DRAGEN/CommandLineOptions.htmHowever, some experimentation suggests that setting
--bcl-num-parallel-tiles
to the number of CPUs on the machine is not a good idea. There are several other options that also control the number of threads, and the default for all of these is "Determined dynamically" according to illumina.They also go on to say:
The online help from the
bcl-convert
CLI gives a bit more detail:I don't quite get what the
HW
means here, but I think the intent is probably there will be at least 12 compression and 8 decompression threads even on a single core machine.My observations of the behaviour of
bcl-convert
under various options is consistent with this documentation. There seem to be a fixed number of compression and decompressions threads globally, and then a set of conversion threads for each tile (the on-line help seems to say these are per tile which disagrees with the documentation, so I'm not entirely sure which is correct). The defaults on an n-core machine are to do one tile at a time and have n conversion threads, n compression threads, and n/2 decompression threads. If one runs bcl-convert without any--bcl-num-parallel-tiles
option or with--bcl-num-parallel-tiles 1
, bcl-convert is already able to make good use of all cores on large machines. E.g. on a 64 core machine it seems that it operates 64 conversion threads, 64 compression threads, and 32 decompression threads, and the machine stays well utilized during operation (though of course I would expect this to depend somewhat on storage throughput).On a 64 core machine, the current nf-core module sets
--bcl-num-parallel-tiles 64
which causes a total of 4192 threads to contend for CPU time and I/O. These appear to consist of:There are two issues here.
One is that 4192 threads is an unnecessarily large number of threads relative to a 64 core machine, the overhead of which is likely to be non-trivial. However, on a modern linux system, overloading CPU by a few orders of magnitude like this would not be catastrophically bad if it was purely computational work. Presumably, though, each of these conversion threads are doing independent I/O and this is likely to result in somewhat dramatic reduction in throughput depending on on how well the underlying storage can cope with so many open file handles and independent read positions.
The other, and I believe much more major issue, is that the compression and decompression threads do not seem to be automatically scaled with the number of conversion threads. As a result, system utilisation in this scenario becomes abysmally low. It seems the conversion threads spend nearly all of their time blocked waiting on access to one of the compression or decompression threads, and the result is that
bcl-convert
runs much more slowly than with--bcl-num-parallel-tiles 1
. I'm not completely sure that there isn't some other fixed resource that they are waiting for (such as some kind of counting semaphore), but in any case the result of running on a machine with many cores and--bcl-num-parallel-tiles
set to the number of cores is to run much much more slowly than with--bcl-num-parallel-tiles 1
.My recommendation would be to eliminate the setting of
--bcl-num-parallel-tiles
entirely, and revert to the default behaviour ofbcl-convert
(which is to process a single tile at a time). This is likely to result in performance improvements in pretty much every scenario where the process is intended to use all detected CPU cores. If it is essential thatnf-core
actively limit the number of threads, then all four relevant settings (--bcl-num-parallel-tiles
,--bcl-num-conversion-threads
,--bcl-num-compression-threads
,--bcl-num-decompression-threads
) should be explicitly set to sensible values.Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: