Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter genomes with provided info before CheckM #179

Merged
merged 1 commit into from
Jan 11, 2023

Conversation

tanaes
Copy link
Contributor

@tanaes tanaes commented Jan 11, 2023

I ran into a stumbling block when trying to run dRep on a large dataset. While most of the bins had CheckM statistics available from DAS_Tool, a minority were missing (see this issue: cmks/DAS_Tool#91).

I figured it would be nice to let drep crossreference the genomeInfo table with the genome input table and only run CheckM on those genomes missing form genomeInfo, rather than erroring out if any are missing.

@tanaes
Copy link
Contributor Author

tanaes commented Jan 11, 2023

Example output from log:

01-11 12:11 DEBUG    Starting the dereplicate operation
01-11 12:11 INFO     ***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************
    
01-11 12:11 DEBUG    Loading work directory in filter
01-11 12:11 DEBUG    Located: /panfs/j4sanders/TMI/sn-mg-pipeline/drep_test
Datatables: []
Cluster files: []
Arguments: []
01-11 12:11 DEBUG    Validating filter arguments
01-11 12:11 INFO     Will filter the genome list
01-11 12:11 INFO     Loading genomes from a list
01-11 12:11 INFO     25,014 genomes were input to dRep
01-11 12:11 INFO     Calculating genome info of genomes
01-11 12:53 DEBUG    Filtering genomes by size
01-11 12:53 INFO     99.99% of genomes passed length filtering
01-11 12:53 DEBUG    Loading provided genome quality information
01-11 12:53 DEBUG    HERE IS GENOME INFO:
01-11 12:53 DEBUG    
                  genome  completeness  contamination
0  10317.X00185754_13.fa           100              0
1  10317.X00185754_59.fa            98              0
2  10317.X00185754_37.fa            98              0
3  10317.X00185754_24.fa            96              2
4   10317.X00185754_6.fa           100              6
01-11 12:53 DEBUG    There are the columns: ['genome', 'completeness', 'contamination']
01-11 12:54 DEBUG    Missing info on 2415 genomes, running CheckM
01-11 12:54 INFO     Running prodigal
logger.log (END)

@MrOlm MrOlm merged commit dd98b5d into MrOlm:master Jan 11, 2023
@MrOlm
Copy link
Owner

MrOlm commented Jan 11, 2023

Thanks @tanaes ! Your changes passed all my local checked and I merged the branch it in. Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants