-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update rosella to v0.5.0 #168
Conversation
…refining. update refine script
…abbed from correct file
Rosella isn't in the rosella.yaml env |
Ah, yep sorry @AroneyS . I was debugging the changes whilst waiting for bioconda to update to had to remove and use the binary on my path and forgot to change back. Should be good now. |
Currently running a full-scale test on samples that I ran with old Rosella refine. |
Is it supposed to Error at the end due to no bins left? refine_metabat2: Click to expand
Similar error for refine_dastool: Click to expand
No error for refine_semibin: Click to expand
|
Nope, it is not meant to error out. Does it continue on anyway though and produce results? Maybe should just check bins exist in the folder before going ahead |
Aviary finishes without refinery error, i.e. bin_info.tsv is generated (though does error due to SingleM) |
Semibin was the only tool without any bins in |
I think it would be best to add a catch prior to running checkm that prevents it crashing. Just make sure there are mags in the folder it is being pointed to, if not then refining is done, run the final step and exit |
Ok, so the error is because the final round has 0 genomes? In that case the results/timings are fine to compare? |
I think it should largely be okay to compare, the timings would definitely be okay to compare as the analysis is done at that point. I'd just be worried that the final round of refining might not get added to the final bin set if it errors out |
Refinery Timing (note that previous ran refinery with 64 threads, current with 16). Though previous wasn't multithreaded for most of the running time.
|
Hey that looks pretty good, nice. Yeah, the checkm step has always been multithreaded so maybe bumping down to 16 threads is causing rosella refine to run slower? It would make sense that it is most affected by that because it has the most bins compared to the other tools at this stage, right? |
I don't think that its CheckM that is spending the time. This is between Rosella logging "Beginning refinement" and CheckM logging "CheckM v1.1.3". Just noticed that the number of genomes is different between "Beginning refinement of 5 MAGs" and "Identifying marker genes in 28 bins...". I used the latter in the table (# of bins in first CheckM run). Should I be comparing based on Rosella's count? |
Ah I understand, that is odd it shouldn't have increased the run time. Confused as to what would cause that. Refine was multithreaded previously, but in stop start fashion so it wasn't efficient. Is there an exceptionally large contaminated MAG? (not the unbinned mags, they get skipped)
5 is the number of initial contaminated bins (There may be others that aren't contaminated and this don't get refined). The 28 is from rosella splitting the 5 bins into 28 bins and then checkm checking their quality Also, noticed your pplacer threads are still set to 48 is this intentional? |
Is it possible to post the logs for the refine_rosella step? It really shouldn't take >3 hours, so am confused. I've added some time information in the logs @AroneyS to help you time the actual events as well. I think I also fixed the point where it was crashing for you, should exit more gracefully now. |
Previous run got 72 bins 70%/10% Is there anything in particular that you are interested in? |
Nah, that's good enough just as long as there wasn't like some crazy drastic drop in recovered bins. Seems good to go then, right? |
Was this resolved? I can rerun refinery with the new logging if that helps |
If you could just rerun with |
Its currently running. Is there a way to see which bins were passed to refinement? |
You can check the initial CheckM file to see which of the bins were contaminated, or the bins in the |
Top files by size: The largest bin in Dastool is 14MB |
The |
Yep, still running. Log so far:
|
Hmm, something is not adding up. Why does it say 25 MAGs when there are only 4 valid MAGs in the folder |
Oh, I just truncated the list. There are more that are smaller |
Okay cool, are there 25 besides the unbinned files? |
Refine rosella has completed again. Similar timing. Click to expand
|
While we are here, do they have to all change their bin names? Would be nice to see which ones were rescued through refinement. |
All good then, just a tricky refinement process. rosella 0.5.1 and flight 1.6.2 add a max_retries flag and lowers the number of default retries which should speed up when samples like this get processed. I'm not sure what you mean regarding bin names? Bins that are unchanged don't have any name change, I don't have time to setup new naming conventions for bins in rosella at the moment so it will have to stay as it is for awhile. |
Really? All of the rosella, semibin and metabat2 bins have "refined" in their name. Were they all refined? I thought it just refined those with large contamination? |
Ah yep. I can see those too. Though |
Flight bioconda update failed with "flight: error: unrecognized arguments: --help" |
Okay, I think this is good to go. I've exposed the max-retries parameters so if you've got a particularly slow set of samples you can lower the max-retries value and it will complete faster. Don't think I've got any other changes I want to add here. We should get this and the citations PR merged, fixup the coverm citations when Ben sorts that out later on. Then I'll push out v0.8.3 |
I'll try rerunning the refine_rosella step. |
@rhysnewell refine_rosella took 3 hours again with the new update. Should I test reducing max-retries? |
@AroneyS only if you're interested, if you find a bunch of samples where this step still slows down then we can revisit this in future but I don't think we need to benchmark further. We've already got significant speed increases in the other steps for this sample that kind of make the one increased step look like a minor blip. I do also suspect we shouldn't expect the rosella_refine step to take 3 hours each time, it should be more like the other steps for the majority of binning attempts. I think it just slowed down because rosella decided to make a few too many large bins in its output |
Sounds good. I can monitor it for the next batch of samples. |
Update includes:
Testing:
Main sticking points:
Are unchanged bins being copied correctly and is checkm data preserved?
I think they are, I've tested on small samples but a large run would not hurt
Ensure checkm2 yaml change was not just an issue with my conda setup
I'm pretty sure it wasn't, but it is weird that it just started installed incorrect versions when previously it has been fine. Should double check on another machine if possible
Large sample check
I've not run on a large sample just the CAMI I low complexity test
Are results improved?
I think the MAG recovery results should be improved from this update, but it would good to see some benchmarks