Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

superstr taking ~6 hours to process a 80GB BAM file #20

Open
chrisclarkson opened this issue Oct 13, 2023 · 3 comments
Open

superstr taking ~6 hours to process a 80GB BAM file #20

chrisclarkson opened this issue Oct 13, 2023 · 3 comments

Comments

@chrisclarkson
Copy link

Hello,
Thank you for making this software available!
I downloaded your software a couple months ago and have been trying it out. I have ~8000 WGS BAM files that I would like to process but it is currently taking 6-8 hours to process them with the following code:

superstr mode=bam -o ${BAM}_out -t 0.64 ${path}

Each genome is ~80GB.
I saw that you have some recommendations for parallelisation. However the xargs options are not available on the cluster that I use- do you have any recommendations for how to parallelise/speed up the process?

Is there a later version of this software that might be faster? I am working on a SLURM HPC.
Thanks again!

@lfearnley
Copy link
Contributor

That does seem to be taking a lot longer than I'd expect, although I've encountered some delays on some HPC configurations.

It's a bit hard to offer immediate recommendations without knowing a little more about your configuration. One thing that can be done very simply is to increase the -t threshold; this will reduce the amount of reads processed during repeat checking.

Are you able to share a bit more about your HPC? Is the data on HDD, SSD? Is there possibly a tape operation slowing things down a bit up front?

@chrisclarkson
Copy link
Author

Hi thank you for getting back to me!
I'm not actually sure if the data are stored on HDD/SSD.
I tried looking on our documentation but it does not say clearly anywhere- I will send our admin a link to this post and hopefully I can clarify later...
The operating system is Centos.
The command lsblk --output NAME,TYPE,ROTA indicates that the HPC has a mixture of both....
I am trying to parallelize across my BAM files as follows:
image

Does this help?

Thanks again

@lfearnley
Copy link
Contributor

No problem. The thing with superSTR is that it needs to read through the BAM file completely, so the first point of call is to check the performance on the read operation. I've seen some spiky performance on network-attached storage under heavy load, so that's always a possibility.

You'd need to cd to the directory, run pwd -P to get the physical path, then check that path against the lsblk output. If the data is on network-attached storage in a HPC that's less likely to be useful.

I'm less familiar with bsub - that's a IBM LSF scheduler command, rather than SLURM? I'll have a look at the manual and see what I can work out from here, but my initial impression is that if you're running one superSTR command per job, then the resource specification there is too high - you only need 1CPU (-n 1). This should allow the scheduler to run more jobs if you're subject to a CPU limit, and increase your total throughput on your system by a factor of 6.

I think from what I can tell the -M command is specifying 72MB of memory, which should be enough but can probably be bumped a bit higher (say to 100000), because you're unlikely to be impacted by memory demand on the scheduler; I'm less certain on the rusage command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants