Skip to content

Commit

Permalink
Merge pull request #98 from sanger-tol/clean_params
Browse files Browse the repository at this point in the history
Overall clean up
muffato authored May 23, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
2 parents acbc472 + 9b1ecc3 commit 6581abf
Showing 16 changed files with 105 additions and 118 deletions.
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -3,6 +3,27 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.5.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.5.0)] – Snorlax – []

General tidy up of the configuration and the pipeline

### Enhancements & fixes

- Increased the resources for blastn
- Removed some options that were not used or not needed

### Parameters

| Old parameter | New parameter |
| --------------- | ------------- |
| --taxa_file | |
| --blastp_outext | |
| --blastp_cols | |
| --blastx_outext | |
| --blastx_cols | |

> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
## [[0.4.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.4.0)] – Buneary – [2024-04-17]

The pipeline has now been validated on dozens of genomes, up to 11 Gbp.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -20,8 +20,8 @@ It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome
4. Run BUSCO ([`busco`](https://busco.ezlab.org/))
5. Extract BUSCO genes ([`blobtoolkit/extractbuscos`](https://github.com/blobtoolkit/blobtoolkit))
6. Run Diamond BLASTp against extracted BUSCO genes ([`diamond/blastp`](https://github.com/bbuchfink/diamond))
7. Run BLASTn against extracted BUSCO genes ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
8. Run BLASTx against extracted BUSCO genes ([`blast/blastx`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
7. Run BLASTx against sequences with no hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
8. Run BLASTn against sequences still with not hit ([`blast/blastx`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
9. Count BUSCO genes ([`blobtoolkit/countbuscos`](https://github.com/blobtoolkit/blobtoolkit))
10. Generate combined sequence stats across various window sizes ([`blobtoolkit/windowstats`](https://github.com/blobtoolkit/blobtoolkit))
11. Imports analysis results into a BlobDir dataset ([`blobtoolkit/blobdir`](https://github.com/blobtoolkit/blobtoolkit))
6 changes: 6 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
@@ -104,6 +104,12 @@ process {
time = { check_max( 3.h * Math.ceil(meta.genome_size/1000000000) * task.attempt, 'time') }
}

withName: "BLAST_BLASTN" {
cpus = { check_max( 24 * task.attempt, 'cpus' ) }
memory = { check_max( 100.MB * task.attempt, 'memory' ) }
time = { check_max( 12.h * task.attempt, 'time' ) }
}

withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
13 changes: 10 additions & 3 deletions modules/nf-core/blast/blastn/blast-blastn.diff

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion modules/nf-core/blast/blastn/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 2 additions & 7 deletions nextflow.config
Original file line number Diff line number Diff line change
@@ -17,11 +17,10 @@ params {
mask = false
fetchngs_samplesheet = false

// Reference options
// Reference options
fasta = null
accession = null
taxon = null
taxa_file = null

// Output options
image_format = 'png'
@@ -32,10 +31,6 @@ params {
blastp = null
blastx = null
blastn = null
blastp_outext = 'txt'
blastp_cols = 'qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'
blastx_outext = 'txt'
blastx_cols = 'qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'

// MultiQC options
multiqc_config = null
@@ -248,7 +243,7 @@ manifest {
description = """Quality assessment of genome assemblies"""
mainScript = 'main.nf'
nextflowVersion = '!>=23.04.0'
version = '0.4.0'
version = '0.5.0'
doi = '10.5281/zenodo.7949058'
}

33 changes: 1 addition & 32 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
@@ -75,7 +75,7 @@
"type": "object",
"fa_icon": "fas fa-dna",
"description": "Reference genome related files and options required for the workflow.",
"required": ["taxon", "accession", "fasta"],
"required": ["taxon", "fasta"],
"properties": {
"taxon": {
"type": ["string", "integer"],
@@ -102,43 +102,12 @@
"description": "Define the location and parameters to work with databases.",
"required": ["blastp", "blastx", "blastn", "taxdump"],
"properties": {
"taxa_file": {
"type": "string",
"format": "file-path",
"description": "Path to file containing the BUSCO lineages for the genome species",
"help_text": "If this file is not included, the relevant BUSCO lineages are automatically calculated using the taxon parameter.",
"fa_icon": "fas fa-file-alt"
},
"busco": {
"type": "string",
"format": "directory-path",
"description": "Local directory where clade-specific BUSCO lineage datasets are stored",
"fa_icon": "fas fa-folder-open"
},
"blastp_cols": {
"type": "string",
"description": "When blastp_outext is 'txt', this is the list of columns that Diamond BLAST should print.",
"default": "qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"
},
"blastp_outext": {
"type": "string",
"enum": ["blast", "xml", "txt", "daa", "sam", "tsv", "paf"],
"description": "Extension (file format) of the output file from Diamond BLAST.",
"fa_icon": "fas fa-file-circle-question",
"default": "txt"
},
"blastx_cols": {
"type": "string",
"description": "When blastx_outext is 'txt', this is the list of columns that Diamond BLAST should print.",
"default": "qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"
},
"blastx_outext": {
"type": "string",
"enum": ["blast", "xml", "txt", "daa", "sam", "tsv", "paf"],
"description": "Extension (file format) of the output file from Diamond BLAST.",
"fa_icon": "fas fa-file-circle-question",
"default": "txt"
},
"blastp": {
"type": "string",
"format": "file-path",
4 changes: 2 additions & 2 deletions subworkflows/local/blobtools.nf
Original file line number Diff line number Diff line change
@@ -28,14 +28,14 @@ workflow BLOBTOOLS {
ch_versions = ch_versions.mix ( BLOBTOOLKIT_METADATA.out.versions.first() )


//
//
// Create Blobtools dataset files
//
BLOBTOOLKIT_CREATEBLOBDIR ( windowstats, busco, blastp, BLOBTOOLKIT_METADATA.out.yaml, taxdump )
ch_versions = ch_versions.mix ( BLOBTOOLKIT_CREATEBLOBDIR.out.versions.first() )


//
//
// Update Blobtools dataset files
//
BLOBTOOLKIT_UPDATEBLOBDIR ( BLOBTOOLKIT_CREATEBLOBDIR.out.blobdir, blastx, blastn, taxdump )
21 changes: 12 additions & 9 deletions subworkflows/local/busco_diamond_blastp.nf
Original file line number Diff line number Diff line change
@@ -12,23 +12,23 @@ include { RESTRUCTUREBUSCODIR } from '../../modules/local/restructurebusco
workflow BUSCO_DIAMOND {
take:
fasta // channel: [ val(meta), path(fasta) ]
taxon_taxa // channel: [ val(meta, val(taxon), path(taxa) ]
taxon // channel: val(taxon)
busco_db // channel: path(busco_db)
blastp // channel: path(blastp_db)
outext // channel: val(out_format)
cols // channel: val(column_names)


main:
ch_versions = Channel.empty()


//
// Fetch BUSCO lineages for taxon (or taxa)
// Fetch BUSCO lineages for taxon
//
GOAT_TAXONSEARCH ( taxon_taxa )
GOAT_TAXONSEARCH (
fasta.combine(taxon).map { meta, fasta, taxon -> [ meta, taxon, [] ] }
)
ch_versions = ch_versions.mix ( GOAT_TAXONSEARCH.out.versions.first() )


//
// Get NCBI species ID
@@ -70,7 +70,7 @@ workflow BUSCO_DIAMOND {
ch_fasta_with_lineage,
"genome",
ch_fasta_with_lineage.map { it[0].lineage_name },
busco_db.collect().ifEmpty([]),
busco_db,
[],
)
ch_versions = ch_versions.mix ( BUSCO.out.versions.first() )
@@ -108,11 +108,14 @@ workflow BUSCO_DIAMOND {

//
// Align BUSCO genes against the BLASTp database
//
//
BLOBTOOLKIT_EXTRACTBUSCOS.out.genes
| filter { it[1].size() > 140 }
| set { ch_busco_genes }

// Hardcoded to match the format expected by blobtools
def outext = 'txt'
def cols = 'qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'
DIAMOND_BLASTP ( ch_busco_genes, blastp, outext, cols )
ch_versions = ch_versions.mix ( DIAMOND_BLASTP.out.versions.first() )

@@ -141,7 +144,7 @@ workflow BUSCO_DIAMOND {


emit:
first_table = ch_first_table // channel: [ val(meta), path(full_table) ]
first_table = ch_first_table // channel: [ val(meta), path(full_table) ]
all_tables = ch_indexed_buscos // channel: [ val(meta), path(full_tables) ]
blastp_txt = DIAMOND_BLASTP.out.txt // channel: [ val(meta), path(txt) ]
taxon_id = ch_taxid // channel: taxon_id
2 changes: 1 addition & 1 deletion subworkflows/local/collate_stats.nf
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ include { BLOBTOOLKIT_WINDOWSTATS } from '../../modules/local/blobtoolkit/window


workflow COLLATE_STATS {
take:
take:
busco // channel: [ val(meta), path(full_table) ]
bed // channel: [ val(meta), path(bed) ]
freq // channel: [ val(meta), path(freq) ]
6 changes: 3 additions & 3 deletions subworkflows/local/coverage_stats.nf
Original file line number Diff line number Diff line change
@@ -10,8 +10,8 @@ include { CREATE_BED } from '../../modules/local/create_bed'


workflow COVERAGE_STATS {
take:
input // channel: [ val(meta), path(aln) ]
take:
input // channel: [ val(meta), path(aln) ]
fasta // channel: [ val(meta), path(fasta) ]


@@ -57,7 +57,7 @@ workflow COVERAGE_STATS {
CREATE_BED ( FASTAWINDOWS.out.mononuc )
ch_versions = ch_versions.mix ( CREATE_BED.out.versions.first() )


// Calculate coverage
BLOBTK_DEPTH ( ch_bam_csi )
ch_versions = ch_versions.mix ( BLOBTK_DEPTH.out.versions.first() )
4 changes: 2 additions & 2 deletions subworkflows/local/minimap_alignment.nf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
//
//
// Optional alignment subworkflow using Minimap2
//

@@ -52,7 +52,7 @@ workflow MINIMAP2_ALIGNMENT {
// Align with Minimap2
MINIMAP2_HIC ( ch_input.hic, fasta, true, false, false )
ch_versions = ch_versions.mix(MINIMAP2_HIC.out.versions.first())

MINIMAP2_ILMN ( ch_input.illumina, fasta, true, false, false )
ch_versions = ch_versions.mix(MINIMAP2_ILMN.out.versions.first())

2 changes: 1 addition & 1 deletion subworkflows/local/prepare_genome.nf
Original file line number Diff line number Diff line change
@@ -48,7 +48,7 @@ workflow PREPARE_GENOME {
ch_fasta = ch_genome
}


emit:
genome = ch_fasta // channel: [ meta, path(genome) ]
versions = ch_versions // channel: [ versions.yml ]
14 changes: 7 additions & 7 deletions subworkflows/local/run_blastn.nf
Original file line number Diff line number Diff line change
@@ -12,8 +12,8 @@ include { BLOBTOOLKIT_UNCHUNK } from '../../modules/local/blobtoolkit/u


workflow RUN_BLASTN {
take:
blast_table // channel: [ val(meta), path(blast_table) ]
take:
blast_table // channel: [ val(meta), path(blast_table) ]
fasta // channel: [ val(meta), path(fasta) ]
blastn // channel: [ val(meta), path(blastn_db) ]
taxon_id // channel: val(taxon_id)
@@ -27,16 +27,16 @@ workflow RUN_BLASTN {
// Get list of sequence ids with no hits in diamond blastx search
NOHIT_LIST ( blast_table, fasta )
ch_versions = ch_versions.mix ( NOHIT_LIST.out.versions.first() )

// Subset of sequences with no hits
SEQTK_SUBSEQ (
fasta,
NOHIT_LIST.out.nohitlist.map { meta, nohit -> nohit }
NOHIT_LIST.out.nohitlist.map { meta, nohit -> nohit } . filter { it.size() > 0 }
)
ch_versions = ch_versions.mix ( SEQTK_SUBSEQ.out.versions.first() )
// Split long contigs into chunks


// Split long contigs into chunks
// create chunks
BLOBTOOLKIT_CHUNK ( SEQTK_SUBSEQ.out.sequences, [[],[]] )
ch_versions = ch_versions.mix ( BLOBTOOLKIT_CHUNK.out.versions.first() )
7 changes: 4 additions & 3 deletions subworkflows/local/run_blastx.nf
Original file line number Diff line number Diff line change
@@ -11,8 +11,6 @@ workflow RUN_BLASTX {
fasta // channel: [ val(meta), path(fasta) ]
table // channel: [ val(meta), path(busco_table) ]
blastx // channel: [ val(meta), path(blastx_db) ]
outext // channel: val(out_format)
cols // channel: val(column_names)


main:
@@ -29,9 +27,12 @@ workflow RUN_BLASTX {
//
// Run diamond_blastx
//
// Hardocded to match the format expected by blobtools
def outext = 'txt'
def cols = 'qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'
DIAMOND_BLASTX ( BLOBTOOLKIT_CHUNK.out.chunks, blastx, outext, cols)
ch_versions = ch_versions.mix ( DIAMOND_BLASTX.out.versions.first() )


//
// Unchunk chunked blastx results
76 changes: 31 additions & 45 deletions workflows/blobtoolkit.nf
Original file line number Diff line number Diff line change
@@ -17,22 +17,24 @@ WorkflowBlobtoolkit.initialise(params, log)

// Add all file path parameters for the pipeline to the list below
// Check input path parameters to see if they exist
def checkPathParamList = [ params.input, params.multiqc_config, params.fasta, params.taxa_file, params.taxdump, params.busco, params.blastp, params.blastx ]
def checkPathParamList = [ params.input, params.multiqc_config, params.fasta, params.taxdump, params.busco, params.blastp, params.blastx ]
for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } }

// Check mandatory parameters
if (params.input) { ch_input = file(params.input) } else { exit 1, 'Input samplesheet not specified!' }
if (params.fasta && params.accession) { ch_fasta = Channel.of([ [ 'id': params.accession ], params.fasta ]).first() } else { exit 1, 'Genome fasta file and accession must be specified!' }
if (params.taxon) { ch_taxon = Channel.of(params.taxon) } else { exit 1, 'NCBI Taxon ID not specified!' }
if (params.blastp && params.accession) { ch_blastp = Channel.of([ [ 'id': params.accession ], params.blastp ]).first() } else { exit 1, 'Diamond BLASTp database and accession must be specified!' }
if (params.blastx && params.accession) { ch_blastx = Channel.of([ [ 'id': params.accession ], params.blastx ]).first() } else { exit 1, 'Diamond BLASTx database and accession must be specified!' }
if (params.blastn && params.accession) { ch_blastn = Channel.of([ [ 'id': params.accession ], params.blastn ]).first() } else { exit 1, 'BLASTn database not specified!' }
if (params.fasta) { ch_fasta = Channel.value([ [ 'id': params.accession ?: file(params.fasta.replace(".gz", "")).baseName ], file(params.fasta) ]) } else { exit 1, 'Genome fasta file must be specified!' }
if (params.taxon) { ch_taxon = Channel.value(params.taxon) } else { exit 1, 'NCBI Taxon ID not specified!' }
if (params.blastp) { ch_blastp = Channel.value([ [ 'id': file(params.blastp).baseName ], params.blastp ]) } else { exit 1, 'Diamond BLASTp database must be specified!' }
if (params.blastx) { ch_blastx = Channel.value([ [ 'id': file(params.blastx).baseName ], params.blastx ]) } else { exit 1, 'Diamond BLASTx database must be specified!' }
if (params.blastn) { ch_blastn = Channel.value([ [ 'id': file(params.blastn).baseName ], params.blastn ]) } else { exit 1, 'BLASTn database not specified!' }
if (params.taxdump) { ch_taxdump = file(params.taxdump) } else { exit 1, 'NCBI Taxonomy database not specified!' }
if (params.fetchngs_samplesheet && !params.align) { exit 1, '--align not specified, even though the input samplesheet is a nf-core/fetchngs one - i.e has fastq files!' }

// Create channel for optional parameters
if (params.busco) { ch_busco_db = Channel.fromPath(params.busco) } else { ch_busco_db = Channel.empty() }
if (params.yaml && params.accession) { ch_yaml = Channel.of([ [ 'id': params.accession ], params.yaml ]) } else { ch_yaml = Channel.empty() }
if (params.busco) { ch_busco_db = Channel.fromPath(params.busco).first() } else { ch_busco_db = Channel.value([]) }
if (params.yaml) { ch_yaml = Channel.fromPath(params.yaml) } else { ch_yaml = Channel.empty() }
if (params.yaml && params.accession) { exit 1, '--yaml cannot be provided at the same time as --accession !' }
if (!params.yaml && !params.accession) { exit 1, '--yaml and --accession are both mising. Pick one !' }

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -51,11 +53,6 @@ ch_multiqc_custom_methods_description = params.multiqc_methods_description ? fil
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

//
// MODULE: Loaded from modules/local/
//
include { BLOBTOOLKIT_CONFIG } from '../modules/local/blobtoolkit/config'

//
// SUBWORKFLOW: Consisting of a mix of local and nf-core/modules
//
@@ -108,7 +105,7 @@ workflow BLOBTOOLKIT {
INPUT_CHECK ( ch_input, PREPARE_GENOME.out.genome, ch_yaml )
ch_versions = ch_versions.mix ( INPUT_CHECK.out.versions )

//
//
// SUBWORKFLOW: Optional read alignment
//
if ( params.align ) {
@@ -120,70 +117,59 @@ workflow BLOBTOOLKIT {
}

//
// SUBWORKFLOW: Calculate genome coverage and statistics
// SUBWORKFLOW: Calculate genome coverage and statistics
//
COVERAGE_STATS ( ch_aligned, PREPARE_GENOME.out.genome )
ch_versions = ch_versions.mix ( COVERAGE_STATS.out.versions )

//
// SUBWORKFLOW: Run BUSCO using lineages fetched from GOAT, then run diamond_blastp
//
if (params.taxa_file) {
ch_taxa = Channel.from(params.taxa_file)
ch_taxon_taxa = PREPARE_GENOME.out.genome.combine(ch_taxon).combine(ch_taxa).map { meta, fasta, taxon, taxa -> [ meta, taxon, taxa ] }
} else {
ch_taxon_taxa = PREPARE_GENOME.out.genome.combine(ch_taxon).map { meta, fasta, taxon -> [ meta, taxon, [] ] }
}

BUSCO_DIAMOND (
PREPARE_GENOME.out.genome,
ch_taxon_taxa,
ch_busco_db,
ch_blastp,
params.blastp_outext,
params.blastp_cols
BUSCO_DIAMOND (
PREPARE_GENOME.out.genome,
ch_taxon,
ch_busco_db,
ch_blastp,
)
ch_versions = ch_versions.mix ( BUSCO_DIAMOND.out.versions )

//
// SUBWORKFLOW: Diamond blastx search of assembly contigs against the UniProt reference proteomes
//
RUN_BLASTX (
RUN_BLASTX (
PREPARE_GENOME.out.genome,
BUSCO_DIAMOND.out.first_table,
ch_blastx,
params.blastx_outext,
params.blastx_cols
)
ch_versions = ch_versions.mix ( RUN_BLASTX.out.versions )


//
// SUBWORKFLOW: Run blastn search on sequences that had no blastx hits
//
RUN_BLASTN (
RUN_BLASTX.out.blastx_out,
PREPARE_GENOME.out.genome,
ch_blastn,
RUN_BLASTN (
RUN_BLASTX.out.blastx_out,
PREPARE_GENOME.out.genome,
ch_blastn,
BUSCO_DIAMOND.out.taxon_id
)

//
// SUBWORKFLOW: Collate genome statistics by various window sizes
//
COLLATE_STATS (
COLLATE_STATS (
BUSCO_DIAMOND.out.all_tables,
COVERAGE_STATS.out.bed,
COVERAGE_STATS.out.freq,
COVERAGE_STATS.out.mononuc,
COVERAGE_STATS.out.cov
COVERAGE_STATS.out.bed,
COVERAGE_STATS.out.freq,
COVERAGE_STATS.out.mononuc,
COVERAGE_STATS.out.cov
)
ch_versions = ch_versions.mix ( COLLATE_STATS.out.versions )

//
// SUBWORKFLOW: Create BlobTools dataset
//
BLOBTOOLS (
BLOBTOOLS (
INPUT_CHECK.out.config,
COLLATE_STATS.out.window_tsv,
BUSCO_DIAMOND.out.all_tables,
@@ -193,7 +179,7 @@ workflow BLOBTOOLKIT {
ch_taxdump
)
ch_versions = ch_versions.mix ( BLOBTOOLS.out.versions )

//
// SUBWORKFLOW: Generate summary and static images
//

0 comments on commit 6581abf

Please sign in to comment.